Browsing by Author "Siegel, H. J., advisor"
Now showing 1 - 7 of 7
Results Per Page
Sort Options
Item Open Access A hierarchical framework for energy-efficient resource management in green data centers(Colorado State University. Libraries, 2015) Jonardi, Eric, author; Pasricha, Sudeep, advisor; Siegel, H. J., advisor; Howe, Adele, committee memberData centers and high performance computing systems are increasing in both size and number. The massive electricity consumption of these systems results in huge electricity costs, a trend that will become commercially unsustainable as systems grow even larger. Optimizations to improve energy-efficiency and reduce electricity costs can be implemented at multiple system levels, and are explored in this thesis at the server node, data center, and geo-distributed data center levels. Frameworks are proposed for each level to improve energy-efficiency and reduce electricity costs. As the core count in processors continues to rise, applications are increasingly experiencing performance degradation due to co-location interference arising from contention for shared resources. The first part of this thesis proposes a methodology for modeling these co-location interference effects to enable accurate predictions of execution time for co-located applications, reducing or even eliminating the need to over-provision server resources to meet quality of service requirements, and improving overall system efficiency. In the second part of this thesis a thermal-, power-, and machine-heterogeneity-aware resource allocation framework is proposed for a single data center to reduce both total server power and the power required to cool the data center, while maximizing the reward of the executed workload in over-subscribed scenarios. The final part of this thesis explores the optimization of geo-distributed data centers, which are growing in number with the rise of cloud computing. A geographical load balancing framework with time-of-use pricing and integrated renewable power is designed, and it is demonstrated how increasing the detail of system knowledge and considering all system levels simultaneously can significantly improve electricity cost savings for geo-distributed systems.Item Open Access Dynamic resource management in heterogeneous systems: maximizing utility, value, and energy-efficiency(Colorado State University. Libraries, 2021) Machovec, Dylan, author; Siegel, H. J., advisor; Maciejewski, Anthony A., committee member; Pasricha, Sudeep, committee member; Burns, Patrick, committee memberThe need for high performance computing (HPC) resources is rapidly expanding throughout many technical fields, but there are finite resources available to meet this demand. To address this, it is important to effectively manage these resources to ensure that as much useful work as possible is completed. In this research, HPC systems executing parallel jobs are considered with and without energy constraints. Additionally, the case where preemption is available is considered for HPC systems executing only serial jobs. Dynamic resource management techniques are designed, evaluated, and compared in heterogeneous environments to assign jobs to HPC nodes. These techniques are evaluated based on system-wide performance measures (value or utility), which quantify the amount of useful work accomplished by the HPC system. Near real-time heuristics are designed to optimize performance in specific environments and the best performing techniques are combined using intelligent metaheuristics that dynamically switch between heuristics based on the characteristics of the current environment. Resource management techniques also are designed for the assignment of unmanned aerial vehicles (UAVs) to surveil targets, where performance is characterized by a value-based performance measure and each UAV is constrained in its total energy consumption.Item Open Access Energy- and thermal-aware resource management for heterogeneous high-performance computing systems(Colorado State University. Libraries, 2016) Oxley, Mark, author; Siegel, H. J., advisor; Pasricha, Sudeep, advisor; Maciejewski, Anthony A., committee member; Whitley, Darrell, committee memberToday's high-performance computing (HPC) systems face the issue of balancing electricity (energy) use and performance. Rising energy costs are forcing system operators to either operate within an energy budget or to reduce energy use as much as possible while still maintaining performance-based service agreements. Energy-aware resource management is one method for solving such problems. Resource management in the context of high-performance computing refers to the process of assigning and scheduling workloads to resources (e.g., compute nodes). Because the cooling systems in HPC facilities also consume a considerable amount of energy, it is important to consider the computer room air conditioning (CRAC) units as a controllable resource and to study the relationship (and energy consumption impact) between the computing and cooling systems. In this thesis, we present four primary contributing studies with differing environments and novel techniques designed for each of those environments. Each study proposes new ideas in the field of energy- and thermal-aware resource management for heterogeneous HPC systems. Our first contribution explores the problem of assigning a collection of independent tasks ("bag-of-tasks") to a heterogeneous HPC system in an energy-aware manner, where task execution times vary. We propose two new measures that consider these uncertainties with respect to makespan and energy: makespan-robustness and energy-robustness. We design resource management heuristics to either: (a) maximize makespan-robustness within an energy-robustness constraint, or (b) maximize energy-robustness within a makespan-robustness constraint. Our next contribution studies a rate-based environment where task execution rates are assigned to compute cores within the HPC facility. The performance measure in this study is the reward rate earned for executing tasks. We analyze the impact that co-location interference (i.e., the performance degradation experienced when tasks are simultaneously executing on cores that share memory resources) has on the reward rate. Novel heuristics are designed that maximize the reward rate under power and thermal constraints, considering the interactions between both computing and cooling systems. As part of the third contribution, we design new techniques for a geographical load distribution problem. That is, our proposed techniques intelligently distribute the workload to data centers located in different geographical regions that have varying energy prices and amount of renewable energy available. The novel techniques we propose use knowledge of co-location interference, thermal models, varying energy prices, and available renewable energy at each data center to minimize monetary energy costs while ensuring all tasks in the workload are completed. Our final contribution is a new energy- and thermal-aware runtime framework designed to maximize reward earned from completing individual tasks by their deadlines within energy and thermal constraints. Thermal-aware resource management strategies often consult thermal models to intelligently determine which cores in the HPC facility to assign workloads. However, the time required to perform the thermal model calculations can be prohibitive in a runtime environment. Therefore, we propose a novel offline-assisted online resource management technique where the online resource manager uses information obtained from offine-generated solutions to help in its thermal-aware decision making.Item Open Access Heterogeneous computing environment characterization and thermal-aware scheduling strategies to optimize data center power consumption(Colorado State University. Libraries, 2012) Al-Qawasmeh, Abdulla, author; Siegel, H. J., advisor; Maciejewski, Anthony A., advisor; Pasricha, Sudeep, committee member; Wang, Haonan, committee memberMany computing systems are heterogeneous both in terms of the performance of their machines and in terms of the characteristics and computational complexity of the tasks that execute on them. Furthermore, different tasks are better suited to execute on specific types of machines. Optimally mapping tasks to machines in a heterogeneous system is, in general, an NP-complete problem. In most cases, heuristics are used to find near-optimal mappings. The performance of allocation heuristics can be affected significantly by factors such as task and machine heterogeneities. In this thesis, different measures are identified to be used in quantifying the heterogeneity of HC systems and the correlation between the performance of the heuristics and these measures is shown. The power consumption of data centers has been increasing at a rapid rate over the past few years. Motivated by the need to reduce the power consumption of data centers, many researchers have been investigating methods to increase the energy efficiency in computing at different levels: chip, server, rack, and data center. Many of today's data centers experience physical limitations on the power needed to run the data center. The first problem that is studied in this thesis is maximizing the performance of a data center that is subject to total power consumption and thermal constraints. A power model for a data center that includes power consumed in both Computer Room Air Conditioning (CRAC) units and compute nodes is considered. The approach in this thesis quantifies the performance of the data center as the total reward collected from completing tasks in a workload by their individual deadlines. The second problem that is studied in this research is how to minimize the power consumption in a data center while guaranteeing that the overall performance does not drop below a specified threshold. For both problems, novel optimization techniques for assigning the performance states of compute cores at the data center level to optimize the operation of the data center are developed. The assignment techniques are divided into two stages. The first stage assigns the P-states of cores, the desired number of tasks per unit time allocated to a core, and the outlet CRAC temperatures. The second stage assigns individual tasks as they arrive at the data center to cores so that the actual number of tasks per unit time allocated to a core approaches the desired number set by the first stage.Item Open Access Resource management for extreme scale high performance computing systems in the presence of failures(Colorado State University. Libraries, 2018) Dauwe, Daniel, author; Pasricha, Sudeep, advisor; Siegel, H. J., advisor; Maciejewski, Anthony A., committee member; Burns, Patrick J., committee memberHigh performance computing (HPC) systems, such as data centers and supercomputers, coordinate the execution of large-scale computation of applications over tens or hundreds of thousands of multicore processors. Unfortunately, as the size of HPC systems continues to grow towards exascale complexities, these systems experience an exponential growth in the number of failures occurring in the system. These failures reduce performance and increase energy use, reducing the efficiency and effectiveness of emerging extreme-scale HPC systems. Applications executing in parallel on individual multicore processors also suffer from decreased performance and increased energy use as a result of applications being forced to share resources, in particular, the contention from multiple application threads sharing the last-level cache causes performance degradation. These challenges make it increasingly important to characterize and optimize the performance and behavior of applications that execute in these systems. To address these challenges, in this dissertation we propose a framework for intelligently characterizing and managing extreme-scale HPC system resources. We devise various techniques to mitigate the negative effects of failures and resource contention in HPC systems. In particular, we develop new HPC resource management techniques for intelligently utilizing system resources through the (a) optimal scheduling of applications to HPC nodes and (b) the optimal configuration of fault resilience protocols. These resource management techniques employ information obtained from historical analysis as well as theoretical and machine learning methods for predictions. We use these data to characterize system performance, energy use, and application behavior when operating under the uncertainty of performance degradation from both system failures and resource contention. We investigate how to better characterize and model the negative effects from system failures as well as application co-location on large-scale HPC computing systems. Our analysis of application and system behavior also investigates: the interrelated effects of network usage of applications and fault resilience protocols; checkpoint interval selection and its sensitivity to system parameters for various checkpoint-based fault resilience protocols; and performance comparisons of various promising strategies for fault resilience in exascale-sized systems.Item Open Access Resource management in heterogeneous computing systems with tasks of varying importance(Colorado State University. Libraries, 2014) Khemka, Bhavesh, author; Maciejewski, Anthony A., advisor; Siegel, H. J., advisor; Pasricha, Sudeep, committee member; Koenig, Gregory A., committee member; Burns, Patrick J., committee memberThe problem of efficiently assigning tasks to machines in heterogeneous computing environments where different tasks can have different levels of importance (or value) to the computing system is a challenging one. The goal of this work is to study this problem in a variety of environments. One part of the study considers a computing system and its corresponding workload based on the expectations for future environments of Department of Energy and Department of Defense interest. We design heuristics to maximize a performance metric created using utility functions. We also create a framework to analyze the trade-offs between performance and energy consumption. We design techniques to maximize performance in a dynamic environment that has a constraint on the energy consumption. Another part of the study explores environments that have uncertainty in the availability of the compute resources. For this part, we design heuristics and compare their performance in different types of environments.Item Open Access Robust resource allocation in heterogeneous parallel and distributed computing systems(Colorado State University. Libraries, 2008) Smith, James T., II, author; Siegel, H. J., advisor; Maciejewski, A. A., advisorIn a heterogeneous distributed computing environment, it is often advantageous to allocate system resources in a manner that optimizes a given system performance measure. However, this optimization is often dependent on system parameters whose values are subject to uncertainty. Thus, an important research problem arises when system resources must be allocated given uncertainty in system parameters. Robustness can be defined as the degree to which a system can function correctly in the presence of parameter values different from those assumed. In this research, we define mathematical models of robustness in both static and dynamic stochastic environments. In addition, we model dynamic environments where estimates of system parameter values are provided as point estimates where these estimates are known to deviate substantially from their actual values. The main contributions of this research are (1) mathematical models of robustness suitable for dynamic environments based on single estimates of system parameters (2) a mathematical model of robustness applicable to environments where the uncertainty in system parameters can be modeled stochastically, (3) a demonstration of the use of this metric to design resource allocation heuristics in a static environment, (4) a mathematical model of robustness in a stochastic dynamic environment, (5) we demonstrate the utility of this dynamic robustness metric through the design of resource allocation heuristics, (6) the derivation of a robustness metric for evaluating resource allocation decisions in an overlay network along with a near optimal resource allocation technique suitable to this environment.