Repository logo

Extending and validating the stencil processing unit

dc.contributor.authorRajasree, Revathy, author
dc.contributor.authorRajopadhye, Sanjay, advisor
dc.contributor.authorPasricha, Sudeep, committee member
dc.contributor.authorAnderson, Charles W., committee member
dc.description.abstractStencils are an important class of programs that appear in the core of many scientific and general-purpose applications. These compute-intensive kernels can benefit heavily from the massive compute power of accelerators like the GPGPU. However, due to the absence of any form of on-chip communication between the coarse-grain processors on a GPU, any data transfer/synchronization between the dependent tiles in stencil computations has to happen through the off-chip (global) memory, which is quite energy-expensive. In the road to exascale computing, energy is becoming an important cost metric. The need for hardware and software that can collaboratively work towards reducing energy consumption of a system is becoming more and more important. To make the execution of dense stencils more energy efficient, Rajopadhye et al. proposed the GPGPU-based accelerator called Stencil Processing Unit that introduces a simple neighbor-to-neighbor communication between the Streaming Multiprocessors (SM) on the GPU, thereby allowing some restricted data sharing between consecutive threadblocks. The SPU includes special storage units, called Communication Buffers, to orchestrate this data transfer and also provides an explicit mechanism for inter-threadblock synchronization by way of a special instruction. It claims to achieve energy-efficiency, compared to GPUs, by reducing the number of off-chip accesses in stencils which in turn reduces the dynamic energy overhead. Uguen developed a cycle-accurate performance simulator for the SPU, called SPU-Sim, and evaluated it using a matrix multiplication kernel which was not suitable for this accelerator. This work focuses on extending the SPU-Sim and evaluating the SPU architecture using a more insightful benchmark. We introduce a producer-consumer based inter-block synchronization approach on the SPU, which is more efficient than the previous global synchronization, and an overlapped multi-pass execution model in the SPU runtime system. These optimizations have been implemented into SPU-Sim. Furthermore, the existing GPUWattch power model in the simulator has been refined to provide better power estimates for the SPU architecture. The improved architecture has been evaluated using a simple 2-D stencil benchmark and we observe an average of 16% savings in dynamic energy on SPU compared to a fairly close GPU platform. Nonetheless, the total energy consumption on SPU is still comparatively high due to the static energy component. This high static energy on SPU is a direct impact of the increased leakage power of the platform resulting from the inclusion of special load/store units. Our conservative estimates indicate that replacing the current design of these L/S units with DMA engines can bring about a 15% decrease in the current leakage power of the SPU and this can help SPU outperform GPU in terms of energy.
dc.format.mediumborn digital
dc.format.mediummasters theses
dc.publisherColorado State University. Libraries
dc.rightsCopyright and other restrictions may apply. User is responsible for compliance with all applicable laws. For information about copyright law, please see
dc.titleExtending and validating the stencil processing unit
dcterms.rights.dplaThis Item is protected by copyright and/or related rights ( You are free to use this Item in any way that is permitted by the copyright and related rights legislation that applies to your use. For other uses you need to obtain permission from the rights-holder(s). and Computer Engineering State University of Science (M.S.)


Original bundle
Now showing 1 - 1 of 1
Thumbnail Image
368.42 KB
Adobe Portable Document Format