Throughput optimization techniques for heterogeneous architectures

Derumigny, Nicolas, authorPouchet, Louis-Noël, advisorRastello, Fabrice, advisorHack, Sebastian, committee memberRohou, Erven, committee memberMalaiya, Yashwant, committee memberOrtega, Francisco, committee memberPétrot, Frédéric, committee memberWilson, James, committee memberZaks, Ayal, committee memberThroughput optimization techniques for heterogeneous architecturesColorado State University. Libraries2024My UniversityMy University2024-05-272024-05-272024engTexthttps://hdl.handle.net/10217/238465https://doi.org/10.25675/3.02989born digitaldoctoral dissertationsCopyright and other restrictions may apply. User is responsible for compliance with all applicable laws. For information about copyright law, please see https://libguides.colostate.edu/copyright.Abstract in English and French.Moore's Law has allowed during the past 40 years to exponentially increase transistor density of integrated circuits. As a result, computing devices ranging from general-purpose processors to dedicated accelerators have become more and more complex due to the specialization and the multiplication of their compute units. Therefore, both low-level program optimization (e.g. assembly-level programming and generation) and accelerator design must solve the issue of efficiently mapping the input program computations to the various chip capabilities. However, real-world chip blueprints are not openly accessible in practice, and their documentation is often incomplete. Given the diversity of CPUs available (Intel's / AMD's / Arm's microarchitectures), we tackle in this manuscript the problem of automatically inferring a performance model applicable to fine-grain throughput optimization of regular programs. Furthermore, when order of magnitude of performance gain over generic accelerators are needed, domain-specific accelerators must be considered; which raises the same question of the number of dedicated units as well as their functionality. To remedy this issue, we present two complementary approaches: on one hand, the study of single-application specialized accelerators with an emphasis on hardware reuse, and, on the other hand, the generation of semi-specialized designs suited for a user-defined set of applications.