Extending and validating the stencil processing unit

Rajasree, Revathy, author; Rajopadhye, Sanjay, advisor; Pasricha, Sudeep, committee member; Anderson, Charles W., committee member

Extending and validating the stencil processing unit

Files

Rajasree_colostate_0053N_13732.pdf (368.42 KB)

Date

2016

Authors

Rajasree, Revathy, author

Rajopadhye, Sanjay, advisor

Pasricha, Sudeep, committee member

Anderson, Charles W., committee member

Abstract

Stencils are an important class of programs that appear in the core of many scientific and general-purpose applications. These compute-intensive kernels can benefit heavily from the massive compute power of accelerators like the GPGPU. However, due to the absence of any form of on-chip communication between the coarse-grain processors on a GPU, any data transfer/synchronization between the dependent tiles in stencil computations has to happen through the off-chip (global) memory, which is quite energy-expensive. In the road to exascale computing, energy is becoming an important cost metric. The need for hardware and software that can collaboratively work towards reducing energy consumption of a system is becoming more and more important. To make the execution of dense stencils more energy efficient, Rajopadhye et al. proposed the GPGPU-based accelerator called Stencil Processing Unit that introduces a simple neighbor-to-neighbor communication between the Streaming Multiprocessors (SM) on the GPU, thereby allowing some restricted data sharing between consecutive threadblocks. The SPU includes special storage units, called Communication Buffers, to orchestrate this data transfer and also provides an explicit mechanism for inter-threadblock synchronization by way of a special instruction. It claims to achieve energy-efficiency, compared to GPUs, by reducing the number of off-chip accesses in stencils which in turn reduces the dynamic energy overhead. Uguen developed a cycle-accurate performance simulator for the SPU, called SPU-Sim, and evaluated it using a matrix multiplication kernel which was not suitable for this accelerator. This work focuses on extending the SPU-Sim and evaluating the SPU architecture using a more insightful benchmark. We introduce a producer-consumer based inter-block synchronization approach on the SPU, which is more efficient than the previous global synchronization, and an overlapped multi-pass execution model in the SPU runtime system. These optimizations have been implemented into SPU-Sim. Furthermore, the existing GPUWattch power model in the simulator has been refined to provide better power estimates for the SPU architecture. The improved architecture has been evaluated using a simple 2-D stencil benchmark and we observe an average of 16% savings in dynamic energy on SPU compared to a fairly close GPU platform. Nonetheless, the total energy consumption on SPU is still comparatively high due to the static energy component. This high static energy on SPU is a direct impact of the increased leakage power of the platform resulting from the inclusion of special load/store units. Our conservative estimates indicate that replacing the current design of these L/S units with DMA engines can bring about a 15% decrease in the current leakage power of the SPU and this can help SPU outperform GPU in terms of energy.

Subject

CUDA

GPGPU

stencil

energy-efficiency

accelerator

multi-pass

URI

http://hdl.handle.net/10217/176694
https://doi.org/10.25675/3.022680

Collections

2000-2019
Theses and Dissertations

Full item page

Extending and validating the stencil processing unit

Files

Date

Authors

Journal Title

Journal ISSN

Volume Title

Abstract

Description

Rights Access

Subject

Citation

URI

Collections

Endorsement

Review

Supplemented By

Referenced By