Memory hierarchy optimizations have been studied by researchers in many areas including compilers, numerical linear algebra, and theoretical computer science. However, the approaches taken by these communities are very different. The compiler community has invested considerable effort in inventing loop transformations like loop permutation and tiling, and in the development of simple analytical models to determine the values of numerical parameters such as tile sizes required by these transformations. Although the performance of compiler-generated code has improved steadily over the years, it is difficult to retarget restructuring compilers to new platforms because of the need to develop analytical models manually for new platforms.

The search for performance portability has led to the development
of *self-optimizing software systems*. One approach to
self-optimizing software is the *generate-and-test* approach,
which has been used by the dense numerical linear algebra
community to produce high-performance BLAS and FFT libraries.
Another approach to portable memory hierarchy optimization is to
use the divide-and-conquer approach to implementing
*cache-oblivious* algorithms. Each step of divide-and-conquer
generates problems of smaller size and when the working set of the
sub-problems fits in some level of the memory hierarchy, that
sub-problem can be executed without capacity misses at that level.
Although all three approaches have been studied extensively, there
are few experimental studies that have compared these approaches.
How well does the code produced by current self-optimizing systems
perform compared to hand-tuned code? Is empirical search essential
to the generate-and-test approach or is it possible to use
analytical models with platform-specific parameters to reduce the
size of the search space? The cache-oblivious approach uses
divide-and-conquer to perform approximate blocking; how well does
approximate blocking perform compared to precise blocking? This
paper addresses such questions for matrix multiplication, which is
the most important dense linear algebra kernel.