Resource management for extreme scale high performance computing systems in the presence of failures

Dauwe, Daniel, author; Pasricha, Sudeep, advisor; Siegel, H. J., advisor; Maciejewski, Anthony A., committee member; Burns, Patrick J., committee member

Resource management for extreme scale high performance computing systems in the presence of failures

dc.contributor.author	Dauwe, Daniel, author
dc.contributor.author	Pasricha, Sudeep, advisor
dc.contributor.author	Siegel, H. J., advisor
dc.contributor.author	Maciejewski, Anthony A., committee member
dc.contributor.author	Burns, Patrick J., committee member
dc.date.accessioned	2018-09-10T20:05:51Z
dc.date.available	2018-09-10T20:05:51Z
dc.date.issued	2018
dc.description.abstract	High performance computing (HPC) systems, such as data centers and supercomputers, coordinate the execution of large-scale computation of applications over tens or hundreds of thousands of multicore processors. Unfortunately, as the size of HPC systems continues to grow towards exascale complexities, these systems experience an exponential growth in the number of failures occurring in the system. These failures reduce performance and increase energy use, reducing the efficiency and effectiveness of emerging extreme-scale HPC systems. Applications executing in parallel on individual multicore processors also suffer from decreased performance and increased energy use as a result of applications being forced to share resources, in particular, the contention from multiple application threads sharing the last-level cache causes performance degradation. These challenges make it increasingly important to characterize and optimize the performance and behavior of applications that execute in these systems. To address these challenges, in this dissertation we propose a framework for intelligently characterizing and managing extreme-scale HPC system resources. We devise various techniques to mitigate the negative effects of failures and resource contention in HPC systems. In particular, we develop new HPC resource management techniques for intelligently utilizing system resources through the (a) optimal scheduling of applications to HPC nodes and (b) the optimal configuration of fault resilience protocols. These resource management techniques employ information obtained from historical analysis as well as theoretical and machine learning methods for predictions. We use these data to characterize system performance, energy use, and application behavior when operating under the uncertainty of performance degradation from both system failures and resource contention. We investigate how to better characterize and model the negative effects from system failures as well as application co-location on large-scale HPC computing systems. Our analysis of application and system behavior also investigates: the interrelated effects of network usage of applications and fault resilience protocols; checkpoint interval selection and its sensitivity to system parameters for various checkpoint-based fault resilience protocols; and performance comparisons of various promising strategies for fault resilience in exascale-sized systems.
dc.format.medium	born digital
dc.format.medium	doctoral dissertations
dc.identifier	Dauwe_colostate_0053A_15091.pdf
dc.identifier.uri	https://hdl.handle.net/10217/191485
dc.identifier.uri	https://doi.org/10.25675/3.021726
dc.language	English
dc.language.iso	eng
dc.publisher	Colorado State University. Libraries
dc.relation.ispartof	2000-2019
dc.rights	Copyright and other restrictions may apply. User is responsible for compliance with all applicable laws. For information about copyright law, please see https://libguides.colostate.edu/copyright.
dc.subject	high performance computing
dc.subject	HPC resilience
dc.subject	application performance modeling
dc.subject	resource management
dc.subject	HPC networking
dc.title	Resource management for extreme scale high performance computing systems in the presence of failures
dc.type	Text
dcterms.rights.dpla	This Item is protected by copyright and/or related rights (https://rightsstatements.org/vocab/InC/1.0/). You are free to use this Item in any way that is permitted by the copyright and related rights legislation that applies to your use. For other uses you need to obtain permission from the rights-holder(s).
thesis.degree.discipline	Electrical and Computer Engineering
thesis.degree.grantor	Colorado State University
thesis.degree.level	Doctoral
thesis.degree.name	Doctor of Philosophy (Ph.D.)

Files

Original bundle

Now showing 1 - 1 of 1

Name:: Dauwe_colostate_0053A_15091.pdf
Size:: 7.64 MB
Format:: Adobe Portable Document Format

Download

Collections

2000-2019
Theses and Dissertations