``Design and optimization of large size and low overhead off-chip caches"

Zhao Zhang, Zhichun Zhu, and Xiaodong Zhang 

IEEE Transactions on Computers, Vol. 53, No. 7, 2004, pp. 843-855. 


Large off-chip L3 caches can significantly improve the performance of
memory-intensive applications.  However, conventional L3 SRAM caches are
facing two issues as those applications require increasingly large
caches.  First, an SRAM cache has a limited size due to the low density
and high cost of SRAM, and thus cannot hold the working sets of many
memory-intensive applications.  Second, since the tag checking overhead
of large caches is non-trivial, the existence of L3 caches increases the
cache miss penalty and may even harm the performance of some
memory-intensive applications.  To address these two issues, we present
a new memory hierarchy design that uses cached DRAM to construct a large
size and low overhead off-chip cache.  The high density DRAM portion in
the cached DRAM can hold large working sets, while the small SRAM
portion exploits the spatial locality appearing in L2 miss streams to
reduce the access latency.  The L3 tag array is placed off-chip with the
data array, minimizing the area overhead on the processor for L3 cache;
while a small tag cache is placed on-chip, effectively removing the
off-chip tag access overhead.  A prediction technique accurately
predicts the hit/miss status of an access to the cached DRAM, further
reducing the access latency.

Conducting execution-driven simulations for a 2GHz 4-way issue processor
and with eleven memory-intensive programs from SPEC 2000 benchmark, we
show that a system with a cached DRAM of 64MB DRAM and 128KB on-chip
SRAM cache as the off-chip cache outperforms the same system with an 8MB
SRAM L3 off-chip cache by up to 78% measured by the total execution time.
The average speedup of the system with the cached-DRAM off-chip cache is
25% over the system with the L3 SRAM cache.