mitigating the impact of decompression latency in l1 compressed data caches via prefetching
abstract
expanding cache size is a common approach for reducing cache miss rates and increasing performance in processors. this approach, however, comes at a cost of increased static and dynamic power consumption by the cache. static power scales with the number of transistors in the design, while dynamic power increases with the number of transistors being switched and the effective operating frequency of the cache.
cache compression is a technique that can increase the effective capacity of cache memory without experiencing the same gains in static and dynamic power consumption. alternatively, this technique can reduce the physical size and therefore the static and dynamic energy usage of the cache while maintaining reasonable effective cache capacity. a drawback of compression is that a delay, or decompression latency, is experienced when accessing the compressed data, which affects the critical execution path of the processor. this latency can have a noticeable impact on processor performance, especially when implemented in first level caches.
cache prefetching techniques have been used to hide the latency of lower level memory accesses. this work aims to investigate the combination of current prefetching techniques and cache compression techniques to reduce the effect of decompression latency and therefore improve the feasibility of power reduction via compression in high level caches.
we propose an architecture that combines l1 data cache compression with table-based prefetching to predict which cache lines will require decompression. the architecture then performs decompression in parallel, moving the delay due to decompression off the critical path of the processor. the architecture is verified using 90nm cmos technology simulations in a new branch of simplescalar, using wattch as a baseline, and cache model inputs from cacti. compression and decompression hardware are synthesized using the 90nm cadence gpdk and verified at the register-transfer level.
the results of our verifications demonstrate that using base-delta-immediate (bΔi) compression, in combination with last outcome (lo), stride (s), and two-level (2l) prefetch methods, or hybrid combinations of these methods (s/lo or 2l/s), provides performance improvement over base-delta-immediate (bΔi) compression alone in l1 data cache. on average, across the spec cpu 2000 benchmarks tested, base-delta-immediate (bΔi) compression results in a slowdown of 3.6%. implementing a 1k-set last outcome prefetch mechanism improves slowdown to 2.1% and reduces the energy consumption of the l1 data cache by 21% versus a baseline scheme with no compression.