Reducing off-chip memory accesses of wavefront parallel programs in Graphics Processing Units

Ranasinghe, Waruna, author; Rajopadhye, Sanjay, advisor; Bohm, Wim, committee member; Oprea, Iuliana, committee member

Reducing off-chip memory accesses of wavefront parallel programs in Graphics Processing Units

Files

Ranasinghe_colostate_0053N_12785.pdf (477.5 KB)

Date

2014

Authors

Ranasinghe, Waruna, author

Rajopadhye, Sanjay, advisor

Bohm, Wim, committee member

Oprea, Iuliana, committee member

Abstract

The power wall is one of the major barriers that stands on the way to exascale computing. To break the power wall, overall system power/energy must be reduced, without affecting the performance. We can decrease energy consumption by designing power efficient hardware and/or software. In this thesis, we present a software approach to lower energy consumption of programs targeted for Graphics Processing Units (GPUs). The main idea is to reduce energy consumption by minimizing the amount of off-chip (global) memory accesses. Off-chip memory accesses can be minimized by improving the last level (L2) cache hits. A wavefront is a set of data/tiles that can be processed concurrently. A kernel is a function that get executed in GPU. We propose a novel approach to implement wavefront parallel programs on GPUs. Instead of using one kernel call per wavefront like in the traditional implementation, we use one kernel call for the whole program and organize the order of computations in such a way that L2 cache reuse is achieved. A strip of wavefronts (or a pass) is a collection of partial wavefronts. We exploit the non-preemptive behavior of the thread block scheduler to process a strip of wavefronts (i.e., a pass) instead of processing a complete wavefront at a time. The data transfered by a partial wavefront in a pass is small enough to fit in L2 cache, so that, successive partial wavefronts in the pass reuse the data in L2 cache. Hence the number of off-chip memory accesses is significantly pruned. We also introduce a technique to communicate and synchronize between two thread blocks without limiting the number of thread blocks per kernel or SM. This technique is used to maintain the order of wavefronts. We have analytically shown and experimentally validated the amount of reduction in off-chip memory accesses in our approach. The off-chip memory reads and writes are decreased by a factor of 45 and 3 respectively. We have shown that if GPUs incorporate L2 cache with write-back cache write policy, then off-chip memory writes also get reduced by a factor of 45. Our approach provides 98% and 74% L2 cache read hits and total cache hits respectively and the traditional approach reports only 2% and 1% respectively.

Subject

energy

GPGPU

power

CUDA

Smith-Waterman

synchronization

URI

http://hdl.handle.net/10217/88551
https://doi.org/10.25675/3.021944

Collections

2000-2019
Theses and Dissertations

Full item page

Reducing off-chip memory accesses of wavefront parallel programs in Graphics Processing Units

Files

Date

Authors

Journal Title

Journal ISSN

Volume Title

Abstract

Description

Rights Access

Subject

Citation

URI

Collections

Endorsement

Review

Supplemented By

Referenced By