Towards automatic compilation for energy efficient iterative stencil computations
Date
2016
Authors
Zou, Yun, author
Rajopadhye, Sanjay, advisor
Strout, Michelle M., committee member
Anderson, Chuck W., committee member
Gao, Xinfeng, committee member
Journal Title
Journal ISSN
Volume Title
Abstract
Today, energy has become a critical concern in all aspects of computing. In this thesis, we address the energy efficiency of an important class of programs called "Stencil Computations", which occur frequently in a wide variety of scientific applications. We target the compute intensive stencil computations, and seek to automatically produce codes that minimize energy consumption. Two main energy consumption contributors are addressed in our work -- dynamic memory energy and static energy -- which are proportional to the number of off-chip memory accesses and execution time separately. We first target the dynamic energy consumption, and propose an energy-efficient tiling and parallelization strategy called Flattened Multi-Pass Parallelization (FMPP), it seeks to minimize the total number of off-chip memory accesses without sacrificing execution time. Our strategy uses two-level tiling, which first partitions the iteration space into "passes", and then tiles the passes and executes the passes in a "non-synchronized" or overlapped fashion. Producing such codes are beyond the capability of current tiled code generators, because the schedules used are polynomials, thus are more general than multidimensional schedules. We present a parametric tiled code generation algorithm for FMPP strategy for the programs with parallelogram shaped iteration space. Then, we seek to reduce the static energy consumption by further improving the performance of generated code. We found that existing production compilers fail to vectorize the parametric tiled code efficiently, which is critical to the compiled program's performance. We propose a compilation method for parametrically tiled stencil computations that systematically vectorizes the loops with short vector intrinsics. Our method targets the non-boundary full tiles, trades register loads of register reorganization operations, enables vector register reuse within and across vectorized computations, and incorporates temporary buffering and memory padding to align memory accesses. We developed a semi-automatic code generation framework to support our memory efficient strategy and compilation method for vectorization. Our framework allows a number of optimization choices to be configured (e.g., the trade-off of data reorganization instructions and the number of aligned loads, tiling and parallelization strategy etc). We evaluate our strategy on several modern Intel architectures with a set of stencil benchmarks. Our experimental results shown that our energy efficient tiling and parallelization strategy is able to significantly reduce the dynamic memory energy consumption on different platforms, by about a 74% (resp. 75% and 67%) reduction on an 8-core Xeon E5-2650 v2 (resp. 6-core Xeon E5-2620 v2 and 6-core Xeon E5-2620 v3). This leads to a reduction in the total energy consumption of the program by 2% to 14%. Our vectorized code also shows significant performance improvement over existing compilers. We get an average of 34% performance improvement for Jacobi 1D on all the platforms, and up to 40% performance improvement for some 2D stencils. With the savings in both static energy and dynamic memory energy, we are able to reduce the total energy consumption by 20% in average for 2D stencils on the Xeon E5-2620~v3 platform. The tuning space for our experiment is fairly large (including both optimization choices and tile sizes), and exhaustively searching the whole space is extremely time-consuming. In our work, we also take the first step for building an autotuner for our framework. We propose to use Artificial Neural Networks to assist the tuning process, and present a study of performance tuning with the assistance of neural networks. Our results show that the use of an Artificial Neural Network has a great potential to accurately predict the performance, and can help reduce the search space significantly.
Description
Rights Access
Subject
automatic parallelization
multi-level tiling
vectorization
auto-tuning
artificial neural network
parametric tiling