Extending and validating the stencil processing unit

Rajasree, Revathy, author; Rajopadhye, Sanjay, advisor; Pasricha, Sudeep, committee member; Anderson, Charles W., committee member

Extending and validating the stencil processing unit

dc.contributor.author	Rajasree, Revathy, author
dc.contributor.author	Rajopadhye, Sanjay, advisor
dc.contributor.author	Pasricha, Sudeep, committee member
dc.contributor.author	Anderson, Charles W., committee member
dc.date.accessioned	2016-08-18T23:10:19Z
dc.date.available	2016-08-18T23:10:19Z
dc.date.issued	2016
dc.description.abstract	Stencils are an important class of programs that appear in the core of many scientific and general-purpose applications. These compute-intensive kernels can benefit heavily from the massive compute power of accelerators like the GPGPU. However, due to the absence of any form of on-chip communication between the coarse-grain processors on a GPU, any data transfer/synchronization between the dependent tiles in stencil computations has to happen through the off-chip (global) memory, which is quite energy-expensive. In the road to exascale computing, energy is becoming an important cost metric. The need for hardware and software that can collaboratively work towards reducing energy consumption of a system is becoming more and more important. To make the execution of dense stencils more energy efficient, Rajopadhye et al. proposed the GPGPU-based accelerator called Stencil Processing Unit that introduces a simple neighbor-to-neighbor communication between the Streaming Multiprocessors (SM) on the GPU, thereby allowing some restricted data sharing between consecutive threadblocks. The SPU includes special storage units, called Communication Buffers, to orchestrate this data transfer and also provides an explicit mechanism for inter-threadblock synchronization by way of a special instruction. It claims to achieve energy-efficiency, compared to GPUs, by reducing the number of off-chip accesses in stencils which in turn reduces the dynamic energy overhead. Uguen developed a cycle-accurate performance simulator for the SPU, called SPU-Sim, and evaluated it using a matrix multiplication kernel which was not suitable for this accelerator. This work focuses on extending the SPU-Sim and evaluating the SPU architecture using a more insightful benchmark. We introduce a producer-consumer based inter-block synchronization approach on the SPU, which is more efficient than the previous global synchronization, and an overlapped multi-pass execution model in the SPU runtime system. These optimizations have been implemented into SPU-Sim. Furthermore, the existing GPUWattch power model in the simulator has been refined to provide better power estimates for the SPU architecture. The improved architecture has been evaluated using a simple 2-D stencil benchmark and we observe an average of 16% savings in dynamic energy on SPU compared to a fairly close GPU platform. Nonetheless, the total energy consumption on SPU is still comparatively high due to the static energy component. This high static energy on SPU is a direct impact of the increased leakage power of the platform resulting from the inclusion of special load/store units. Our conservative estimates indicate that replacing the current design of these L/S units with DMA engines can bring about a 15% decrease in the current leakage power of the SPU and this can help SPU outperform GPU in terms of energy.
dc.format.medium	born digital
dc.format.medium	masters theses
dc.identifier	Rajasree_colostate_0053N_13732.pdf
dc.identifier.uri	http://hdl.handle.net/10217/176694
dc.identifier.uri	https://doi.org/10.25675/3.022680
dc.language	English
dc.language.iso	eng
dc.publisher	Colorado State University. Libraries
dc.relation.ispartof	2000-2019
dc.rights	Copyright and other restrictions may apply. User is responsible for compliance with all applicable laws. For information about copyright law, please see https://libguides.colostate.edu/copyright.
dc.subject	CUDA
dc.subject	GPGPU
dc.subject	stencil
dc.subject	energy-efficiency
dc.subject	accelerator
dc.subject	multi-pass
dc.title	Extending and validating the stencil processing unit
dc.type	Text
dcterms.rights.dpla	This Item is protected by copyright and/or related rights (https://rightsstatements.org/vocab/InC/1.0/). You are free to use this Item in any way that is permitted by the copyright and related rights legislation that applies to your use. For other uses you need to obtain permission from the rights-holder(s).
thesis.degree.discipline	Electrical and Computer Engineering
thesis.degree.grantor	Colorado State University
thesis.degree.level	Masters
thesis.degree.name	Master of Science (M.S.)

Files

Original bundle

Now showing 1 - 1 of 1

Name:: Rajasree_colostate_0053N_13732.pdf
Size:: 368.42 KB
Format:: Adobe Portable Document Format

Download

Collections

2000-2019
Theses and Dissertations