Loading...
Please wait, while we are loading the content...
Similar Documents
Leveraging Memory Level Parallelism Using Dynamic Warp Subdivision
| Content Provider | CiteSeerX |
|---|---|
| Author | Meng, Jiayuan Skadron, Kevin Tarjan, David |
| Abstract | SIMD organizations have shown to allow high throughput for data-parallel applications. They can operate on multiple datapaths under the same instruction sequencer, with its set of operations happening in lockstep sometimes referred to as warps and a single lane referred to as a thread. However, ability of SIMD to gather from disparate addresses instead of aligned vectors means that a single long latency memory access will suspend the entire warp until it completes. This under-utilizes the computation resources and sacrifices memory level parallelism because threads that hit are not able to proceed and issue more memory requests. Eventually, the pipeline may stall and performance is penalized. Therefore, we propose warp subdividing techniques that dynamically construct run-ahead “warp-splits ” from threads that hit the cache so that they can run ahead and prefetch cache lines that may be used by others that fall behind. Several optimization strategies are investigated and we evaluate the techniques over two types of memory systems: a bulk-synchronous cache organization and a coherent cache hierarchy. The former has private caches communicating with the main memory with coherence taken care of by global barriers; the latter has private caches coherently sharing an inclusive, on-chip last level cache (LLC). Experiments with eight data-parallel benchmarks show our technique improves performance on average by 15 % on the bulk-synchronous cache organization with a maximum speedup of 1.6X, and 17 % on a coherent cache hierarchy with a maximum speedup of 1.9X. This can be achieved with an area overhead of less than 2%. 1 |
| File Format | |
| Access Restriction | Open |
| Subject Keyword | Instruction Sequencer Disparate Address High Throughput Coherent Cache Hierarchy Memory System Single Long Latency Memory Access Area Overhead Cache Line Sacrifice Memory Level Parallelism Run-ahead Warp-splits Computation Resource Main Memory Private Cache Memory Request Simd Organization Single Lane On-chip Last Level Cache Global Barrier Entire Warp Maximum Speedup Bulk-synchronous Cache Organization Several Optimization Strategy Data-parallel Application Data-parallel Benchmark Multiple Datapaths Aligned Vector |
| Content Type | Text |