Cacheminer: a runtime approach to
exploit cache locality on SMP 

Y. Yan, X. Zhang, and Z. Zhang 

IEEE Transactions on Parallel and Distributed Systems, 
Vol. 11, No. 4, 2000, pp. 357-374. 

Exploiting cache locality of parallel programs
at runtime is a complementary approach
to a compiler optimization. This is particularly important
for those applications with dynamic memory access patterns.
We propose a memory-layout oriented technique to exploit
cache locality of parallel loops at runtime on Symmetric
Multi-Processor (SMP) systems.
Guided by application-dependent and targeted architecture-dependent hints,
our system, called Cacheminer, reorganizes and partitions a parallel loop
using the memory-access space of its execution.
Through effective runtime transformations, our system maximizes the data reuse
in each partitioned data region assigned in a cache, and minimizes
the data sharing among the partitioned data regions assigned to all caches.
The executions of tasks in the
partitions are scheduled in an adaptive and locality-preserved way
to minimize the execution time of programs
by trading off load balance and locality.

We have implemented the Cacheminer runtime library on two commercial
SMP servers and an SimOS simulated SMP.
Our simulation and measurement results show that our runtime
approach can achieve comparable performance with the compiler
optimizations for programs with regular computation and
memory-access patterns, whose load balance and cache locality
can be well optimized by the tiling and other program transformations.
However, our experimental results show that our approach is
able to significantly improve the memory performance
for the applications with irregular computation and
dynamic memory access patterns.
This type of programs is usually hard to be optimized
by static compiler optimizations.