Energy- and thermal-aware resource management for heterogeneous high-performance computing systems
Date
2016
Authors
Oxley, Mark, author
Siegel, H. J., advisor
Pasricha, Sudeep, advisor
Maciejewski, Anthony A., committee member
Whitley, Darrell, committee member
Journal Title
Journal ISSN
Volume Title
Abstract
Today's high-performance computing (HPC) systems face the issue of balancing electricity (energy) use and performance. Rising energy costs are forcing system operators to either operate within an energy budget or to reduce energy use as much as possible while still maintaining performance-based service agreements. Energy-aware resource management is one method for solving such problems. Resource management in the context of high-performance computing refers to the process of assigning and scheduling workloads to resources (e.g., compute nodes). Because the cooling systems in HPC facilities also consume a considerable amount of energy, it is important to consider the computer room air conditioning (CRAC) units as a controllable resource and to study the relationship (and energy consumption impact) between the computing and cooling systems. In this thesis, we present four primary contributing studies with differing environments and novel techniques designed for each of those environments. Each study proposes new ideas in the field of energy- and thermal-aware resource management for heterogeneous HPC systems. Our first contribution explores the problem of assigning a collection of independent tasks ("bag-of-tasks") to a heterogeneous HPC system in an energy-aware manner, where task execution times vary. We propose two new measures that consider these uncertainties with respect to makespan and energy: makespan-robustness and energy-robustness. We design resource management heuristics to either: (a) maximize makespan-robustness within an energy-robustness constraint, or (b) maximize energy-robustness within a makespan-robustness constraint. Our next contribution studies a rate-based environment where task execution rates are assigned to compute cores within the HPC facility. The performance measure in this study is the reward rate earned for executing tasks. We analyze the impact that co-location interference (i.e., the performance degradation experienced when tasks are simultaneously executing on cores that share memory resources) has on the reward rate. Novel heuristics are designed that maximize the reward rate under power and thermal constraints, considering the interactions between both computing and cooling systems. As part of the third contribution, we design new techniques for a geographical load distribution problem. That is, our proposed techniques intelligently distribute the workload to data centers located in different geographical regions that have varying energy prices and amount of renewable energy available. The novel techniques we propose use knowledge of co-location interference, thermal models, varying energy prices, and available renewable energy at each data center to minimize monetary energy costs while ensuring all tasks in the workload are completed. Our final contribution is a new energy- and thermal-aware runtime framework designed to maximize reward earned from completing individual tasks by their deadlines within energy and thermal constraints. Thermal-aware resource management strategies often consult thermal models to intelligently determine which cores in the HPC facility to assign workloads. However, the time required to perform the thermal model calculations can be prohibitive in a runtime environment. Therefore, we propose a novel offline-assisted online resource management technique where the online resource manager uses information obtained from offine-generated solutions to help in its thermal-aware decision making.
Description
Rights Access
Subject
datacenter
optimization
energy
cooling