Repository logo
 

Reducing off-chip memory accesses of wavefront parallel programs in Graphics Processing Units

dc.contributor.authorRanasinghe, Waruna, author
dc.contributor.authorRajopadhye, Sanjay, advisor
dc.contributor.authorBohm, Wim, committee member
dc.contributor.authorOprea, Iuliana, committee member
dc.date.accessioned2007-01-03T06:23:24Z
dc.date.available2007-01-03T06:23:24Z
dc.date.issued2014
dc.description.abstractThe power wall is one of the major barriers that stands on the way to exascale computing. To break the power wall, overall system power/energy must be reduced, without affecting the performance. We can decrease energy consumption by designing power efficient hardware and/or software. In this thesis, we present a software approach to lower energy consumption of programs targeted for Graphics Processing Units (GPUs). The main idea is to reduce energy consumption by minimizing the amount of off-chip (global) memory accesses. Off-chip memory accesses can be minimized by improving the last level (L2) cache hits. A wavefront is a set of data/tiles that can be processed concurrently. A kernel is a function that get executed in GPU. We propose a novel approach to implement wavefront parallel programs on GPUs. Instead of using one kernel call per wavefront like in the traditional implementation, we use one kernel call for the whole program and organize the order of computations in such a way that L2 cache reuse is achieved. A strip of wavefronts (or a pass) is a collection of partial wavefronts. We exploit the non-preemptive behavior of the thread block scheduler to process a strip of wavefronts (i.e., a pass) instead of processing a complete wavefront at a time. The data transfered by a partial wavefront in a pass is small enough to fit in L2 cache, so that, successive partial wavefronts in the pass reuse the data in L2 cache. Hence the number of off-chip memory accesses is significantly pruned. We also introduce a technique to communicate and synchronize between two thread blocks without limiting the number of thread blocks per kernel or SM. This technique is used to maintain the order of wavefronts. We have analytically shown and experimentally validated the amount of reduction in off-chip memory accesses in our approach. The off-chip memory reads and writes are decreased by a factor of 45 and 3 respectively. We have shown that if GPUs incorporate L2 cache with write-back cache write policy, then off-chip memory writes also get reduced by a factor of 45. Our approach provides 98% and 74% L2 cache read hits and total cache hits respectively and the traditional approach reports only 2% and 1% respectively.
dc.format.mediumborn digital
dc.format.mediummasters theses
dc.identifierRanasinghe_colostate_0053N_12785.pdf
dc.identifier.urihttp://hdl.handle.net/10217/88551
dc.languageEnglish
dc.language.isoeng
dc.publisherColorado State University. Libraries
dc.relation.ispartof2000-2019
dc.rightsCopyright and other restrictions may apply. User is responsible for compliance with all applicable laws. For information about copyright law, please see https://libguides.colostate.edu/copyright.
dc.subjectenergy
dc.subjectGPGPU
dc.subjectpower
dc.subjectCUDA
dc.subjectSmith-Waterman
dc.subjectsynchronization
dc.titleReducing off-chip memory accesses of wavefront parallel programs in Graphics Processing Units
dc.typeText
dcterms.rights.dplaThis Item is protected by copyright and/or related rights (https://rightsstatements.org/vocab/InC/1.0/). You are free to use this Item in any way that is permitted by the copyright and related rights legislation that applies to your use. For other uses you need to obtain permission from the rights-holder(s).
thesis.degree.disciplineComputer Science
thesis.degree.grantorColorado State University
thesis.degree.levelMasters
thesis.degree.nameMaster of Science (M.S.)

Files

Original bundle
Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
Ranasinghe_colostate_0053N_12785.pdf
Size:
477.5 KB
Format:
Adobe Portable Document Format
Description: