TR-98-1.ps.Z

Y. Yan  

Exploiting Cache Locality on Symmetric Multiprocessors: A Run-Time Approach 

Ph.D. Disseratation, Department of Computer Science, 
Ohio State University, May 1998. 
 
Abstract
--------

With the increasing gap between the speeds of the processor
and memory system, memory access has become a major
performance bottleneck in modern computer systems.
Recently, Symmetric Multi-Processor (SMP) systems have emerged as a
major class of high-performance platforms. On these SMP systems,
the efficiency of memory access in an application is critical
to its overall execution performance.

Optimizing the cache locality of
a parallel application is an effective approach to reduce
the memory bottleneck effect
to improve the performance of a parallel computation.
For applications with static memory-access patterns, many effective
techniques, such as compile-time locality optimizations, have been
proposed. However, improving the memory performance of
applications with dynamic memory-access patterns is still a hard
problem in the parallel computing area. The solution to this problem
is critical to the success of parallel computing
because dynamic memory-access patterns occur
in many real-world applications.

This dissertation is aimed at solving the above problem.
Based on a rigorous analysis of cache-locality optimization,
we propose a memory-layout oriented run-time technique to exploit
the cache locality of parallel loops on SMP systems.
The proposed technique consists of four components: (1)
a method to estimate and abstract memory-access patterns
of applications, (2) a memory-layout based method to reorganize
tasks to maximize data reuse in caches, (3) a heuristic task
partitioning algorithm to minimize both data-sharing and load
imbalance, and (4) an adaptive and locality-preserved
scheduling algorithm to minimize the parallel execution time
of an application.
These system schemes have been integrated and implemented in a
run-time system.

In order to provide an insightful analysis of our run-time system,
a detailed SMP simulator was built.
Using simulation and measurement, we have shown our run-time
approach can achieve comparable performance with compiler
optimizations for those regular applications, whose load balance and
cache locality
can be well optimized by tiling and other program transformations.
However, our approach
was shown to improve significantly the memory performance
for applications with dynamic memory-access patterns. Such
applications are usually hard to optimize with static compiler optimizations.

The major contributions of this dissertation are:

1) We present models for the cache locality optimization problems
in uniprocessor systems and SMP systems. These models characterize
the complexity and present a solution framework for optimizing
cache locality.

2) We present an effective internal representation for the memory-access
pattern of a parallel loop to support efficient locality optimizations and
information integration.

3) We present a memory-layout oriented run-time technique for
locality optimization.

4) We present efficient scheduling algorithms to trade off locality and
load imbalance.

5) We provide a detailed performance evaluation of the run-time
optimization technique at the architecture level and at the execution level.