ieee08-abstract

Memory hierarchy optimizations have been studied by researchers in many areas including compilers, numerical linear algebra, and theoretical computer science. However, the approaches taken by these communities are very different. The compiler community has invested considerable effort in inventing loop transformations like loop permutation and tiling, and in the development of simple analytical models to determine the values of numerical parameters such as tile sizes required by these transformations. Although the performance of compiler-generated code has improved steadily over the years, it is difficult to retarget restructuring compilers to new platforms because of the need to develop analytical models manually for new platforms.

The search for performance portability has led to the development of self-optimizing software systems. One approach to self-optimizing software is the generate-and-test approach, which has been used by the dense numerical linear algebra community to produce high-performance BLAS and FFT libraries. Another approach to portable memory hierarchy optimization is to use the divide-and-conquer approach to implementing cache-oblivious algorithms. Each step of divide-and-conquer generates problems of smaller size and when the working set of the sub-problems fits in some level of the memory hierarchy, that sub-problem can be executed without capacity misses at that level. Although all three approaches have been studied extensively, there are few experimental studies that have compared these approaches. How well does the code produced by current self-optimizing systems perform compared to hand-tuned code? Is empirical search essential to the generate-and-test approach or is it possible to use analytical models with platform-specific parameters to reduce the size of the search space? The cache-oblivious approach uses divide-and-conquer to perform approximate blocking; how well does approximate blocking perform compared to precise blocking? This paper addresses such questions for matrix multiplication, which is the most important dense linear algebra kernel.