Browsing by Author "Pasricha, Sudeep, advisor"
Now showing 1 - 20 of 26
Results Per Page
Sort Options
Item Open Access A hierarchical framework for energy-efficient resource management in green data centers(Colorado State University. Libraries, 2015) Jonardi, Eric, author; Pasricha, Sudeep, advisor; Siegel, H. J., advisor; Howe, Adele, committee memberData centers and high performance computing systems are increasing in both size and number. The massive electricity consumption of these systems results in huge electricity costs, a trend that will become commercially unsustainable as systems grow even larger. Optimizations to improve energy-efficiency and reduce electricity costs can be implemented at multiple system levels, and are explored in this thesis at the server node, data center, and geo-distributed data center levels. Frameworks are proposed for each level to improve energy-efficiency and reduce electricity costs. As the core count in processors continues to rise, applications are increasingly experiencing performance degradation due to co-location interference arising from contention for shared resources. The first part of this thesis proposes a methodology for modeling these co-location interference effects to enable accurate predictions of execution time for co-located applications, reducing or even eliminating the need to over-provision server resources to meet quality of service requirements, and improving overall system efficiency. In the second part of this thesis a thermal-, power-, and machine-heterogeneity-aware resource allocation framework is proposed for a single data center to reduce both total server power and the power required to cool the data center, while maximizing the reward of the executed workload in over-subscribed scenarios. The final part of this thesis explores the optimization of geo-distributed data centers, which are growing in number with the rise of cloud computing. A geographical load balancing framework with time-of-use pricing and integrated renewable power is designed, and it is demonstrated how increasing the detail of system knowledge and considering all system levels simultaneously can significantly improve electricity cost savings for geo-distributed systems.Item Open Access A semi-dynamic resource management framework for multicore embedded systems with energy harvesting(Colorado State University. Libraries, 2015) Xiang, Yi, author; Pasricha, Sudeep, advisor; Jayasumana, Anura, committee member; Siegel, H. J., committee member; Strout, Michelle Mills, committee memberSemiconductor technology has been evolving rapidly over the past several decades, introducing a new breed of embedded systems that are tiny, efficient, and pervasive. These embedded systems are the backbone of the ubiquitous and pervasive computing revolution, embedded intelligence all around us. Often, such embedded intelligence for pervasive computing must be deployed at remote locations, for purposes of environment sensing, data processing, information transmission, etc. Compared to current mobile devices, which are mostly supported by rechargeable and exchangeable batteries, emerging embedded systems for pervasive computing favor a self-sustainable energy supply, as their remote and mass deployment makes it impractical to change or charge their batteries. The ability to sustain systems by scavenging energy from ambient sources is called energy harvesting, which is gaining monument for its potential to enable energy autonomy in the era of pervasive computing. Among various energy harvesting techniques, solar energy harvesting has attracted the most attention due to its high power density and availability. Another impact of semiconductor technology scaling into the deep submicron level is the shifting of design focus from performance to energy efficiency as power dissipation on a chip cannot increase indefinitely. Due to unacceptable power consumption at high clock rate, it is desirable for computing systems to distribute workload on multiple cores with reduced execution frequencies so that overall system energy efficiency improves while meeting performance goals. Thus it is necessary to adopt the design paradigm of multiprocessing for low-power embedded systems due to the ever-increasing demands for application performance and stringent limitations on power dissipation. In this dissertation we focus on the problem of resource management for multicore embedded systems powered by solar energy harvesting. We have conducted a substantial amount of research on this topic, which has led to the design of a semi-dynamic resource management framework designed with emphasis on efficiency and flexibility that can be applied to energy harvesting-powered systems with a variety of functionality, performance, energy, and reliability goals. The capability and flexibility of the proposed semi-dynamic framework are verified by issues we have addressed with it, including: (i) minimizing miss rate/miss penalty of systems with energy harvesting, (ii) run-time thermal control, (iii) coping with process variation induced core-to-core heterogeneity, (iv) management of hybrid energy storage, (v) scheduling of task graphs with inter-node dependencies, (vi) addressing soft errors during execution, (vii) mitigating aging effects across the chip over time, and (vii) supporting mixed-criticality scheduling on heterogeneous processors.Item Open Access An integrated variation-aware mapping framework for FinFET based irregular 2D MPSoCs in the dark silicon era(Colorado State University. Libraries, 2016) Rajkrishna, Pramit, author; Pasricha, Sudeep, advisor; Jayasumana, Anura, committee member; Burns, Patrick, committee memberIn the deep submicron era, process variations and dark silicon considerations have become prominent focus areas for early stage networks-on-chip (NoC) design synthesis. Additionally, FinFETs have been implemented as promising alternatives to bulk CMOS implementations for 22nm and below technology nodes to mitigate leakage power. While overall system power in a dark silicon paradigm is governed by a limitation on active cores and inter-core communication patterns, it has also become imperative to consider process variations in a holistic context for irregular 2D NoCs. Additionally, manufacturing defects induce link failures, with resultant irregularity in the NoC topology and rendering conventional minimal routing schemes for regular topologies inoperable. In this thesis, we propose a holistic process variation aware design time synthesis framework (HERMES) that performs computation and communication mapping while minimizing energy consumption and maximizing Power Performance Yield (PPY). The framework targets a 22nm FinFET based homogenous NoC implementation with design time link failures in the NoC fabric, a dark silicon based power constraint and system bandwidth constraints for performance guarantees, while preserving connectivity and deadlock freedom in the NoC fabric. Our experimental results show that HERMES performs 1.32x better in energy, 1.29x better in simulation execution time and 58.44% better in PPY statistics, over other state-of-the-art proposed mapping techniques for various SPLASH2 and PARSEC parallel benchmarks.Item Open Access An intelligent, mobile network aware middleware framework for energy efficient offloading in smartphones(Colorado State University. Libraries, 2017) Khune, Aditya Dilip, author; Pasricha, Sudeep, advisor; Jayasumana, Anura P., committee member; Gesumaria, Bob, committee memberOffloading mobile computations is an innovative technique that is being explored by researchers for reducing energy consumption in mobile devices and for achieving better application response time. Offloading refers to the act of transferring computations from a mobile device to servers in the cloud. There are many challenges in this domain that are not dealt with effectively yet, and thus offloading is far from being adopted in the design of current mobile architectures. We believe that there is a need to verify the effectiveness of computation offloading in terms of both response time and energy consumption, to highlight its potential in real smartphone applications. The effect of varying network technologies such as 3G, 4G, and Wi-Fi on the performance of offloading systems is also a major concern that needs to be addressed. In this thesis, we study the behavior of a set of real smartphone applications, in both local and offload processing modes. Our experiments identify the advantages and disadvantages of offloading for various mobile networks. Further, we propose a middleware framework that uses Reinforcement Learning to make reward-based offloading decisions effectively. Our framework allows a smartphone to consider suitable contextual information to determine when it makes sense to offload, and to select between available networks (3G, 4G, or Wi-Fi) when offloading mode is active. We tested our framework in both simulated and real environments, across various applications, to demonstrate how energy consumption can be minimized in mobile systems that are capable of supporting offloading.Item Open Access Anomaly detection with machine learning for automotive cyber-physical systems(Colorado State University. Libraries, 2022) Thiruloga, Sooryaa Vignesh, author; Pasricha, Sudeep, advisor; Kim, Ryan, committee member; Ray, Indrakshi, committee memberToday's automotive systems are evolving at a rapid pace and there has been a seismic shift in automotive technology in the past few years. Automakers are racing to redefine the automobile as a fully autonomous and connected system. As a result, new technologies such as advanced driver assistance systems (ADAS), vehicle-to-vehicle (V2V), 5G vehicle to infrastructure (V2I), and vehicle to everything (V2X), etc. have emerged in recent years. These advances have resulted in increased responsibilities for the electronic control units (ECUs) in the vehicles, requiring a more sophisticated in-vehicle network to address the growing communication needs of ECUs with each other and external subsystems. This in turn has transformed modern vehicles into a complex distributed cyber-physical system. The ever-growing connectivity to external systems in such vehicles is introducing new challenges, related to the increasing vulnerability of such vehicles to various cyber-attacks. A malicious actor can use various access points in a vehicle, e.g., Bluetooth and USB ports, telematic systems, and OBD-II ports, to gain unauthorized access to the in-vehicle network. These access points are used to gain access to the network from the vehicle's attack surface. After gaining access to the in-vehicle network through an attack surface, a malicious actor can inject or alter messages on the network to try to take control of the vehicle. Traditional security mechanisms such as firewalls only detect simple attacks as they do not have the ability to detect more complex attacks. With the increasing complexity of vehicles, the attack surface increases, paving the way for more complex and novel attacks in the future. Thus, there is a need for an advanced attack detection solution that can actively monitor the in-vehicle network and detect complex cyber-attacks. One of the many approaches to achieve this is by using an intrusion detection system (IDS). Many state-of-the-art IDS employ machine learning algorithms to detect cyber-attacks for its ability to detect both previously observed as well as novel attack patterns. Moreover, the large availability of in-vehicle network data and increasing computational power of the ECUs to handle emerging complex automotive tasks facilitates the use of machine learning models. Therefore, due to its large spectrum of attack coverage and ability to detect complex attack patterns, we adopt and propose two novel machine learning based IDS frameworks (LATTE and TENET) for in-vehicle network anomaly detection. Our proposed LATTE framework uses sequence models, such as LSTMs, in an unsupervised setting to learn the normal system behavior. LATTE leverages the learned information at runtime to detect anomalies by observing for any deviations from the learned normal behavior. Our proposed LATTE framework aims to maximize the anomaly detection accuracy, precision, and recall while minimizing the false-positive rate. The increased complexity of automotive systems has resulted in very long term dependencies between messages which cannot be effectively captured by LSTMs. Hence to overcome this problem, we proposed a novel IDS framework called TENET. TENET employs a novel convolutional neural attention (TCNA) based architecture to effectively learn very-long term dependencies between messages in an in-vehicle network during the training phase and leverage the learned information in combination with a decision tree classifier to detect anomalous messages. Our work aims to efficiently detect a multitude of attacks in the in-vehicle network with low memory and computational overhead on the ECU.Item Open Access Cloud Computing cost and energy optimization through Federated Cloud SoS(Colorado State University. Libraries, 2017) Biran, Yahav, author; Collins, George J., advisor; Pasricha, Sudeep, advisor; Young, Peter, committee member; Borky, John M., committee member; Zimmerle, Daniel J., committee memberThe two most significant differentiators amongst contemporary Cloud Computing service providers have increased green energy use and datacenter resource utilization. This work addresses these two issues from a system's architectural optimization viewpoint. The proposed approach herein, allows multiple cloud providers to utilize their individual computing resources in three ways by: (1) cutting the number of datacenters needed, (2) scheduling available datacenter grid energy via aggregators to reduce costs and power outages, and lastly by (3) utilizing, where appropriate, more renewable and carbon-free energy sources. Altogether our proposed approach creates an alternative paradigm for a Federated Cloud SoS approach. The proposed paradigm employs a novel control methodology that is tuned to obtain both financial and environmental advantages. It also supports dynamic expansion and contraction of computing capabilities for handling sudden variations in service demand as well as for maximizing usage of time varying green energy supplies. Herein we analyze the core SoS requirements, concept synthesis, and functional architecture with an eye on avoiding inadvertent cascading conditions. We suggest a physical architecture that diminishes unwanted outcomes while encouraging desirable results. Finally, in our approach, the constituent cloud services retain their independent ownership, objectives, funding, and sustainability means. This work analyzes the core SoS requirements, concept synthesis, and functional architecture. It suggests a physical structure that simulates the primary SoS emergent behavior to diminish unwanted outcomes while encouraging desirable results. The report will analyze optimal computing generation methods, optimal energy utilization for computing generation as well as a procedure for building optimal datacenters using a unique hardware computing system design based on the openCompute community as an illustrative collaboration platform. Finally, the research concludes with security features cloud federation requires to support to protect its constituents, its constituents tenants and itself from security risks.Item Open Access Design and optimization of emerging interconnection and memory subsystems for future manycore architectures(Colorado State University. Libraries, 2018) Thakkar, Ishan G., author; Pasricha, Sudeep, advisor; Bohm, Wim, committee member; Jayasumana, Anura, committee member; Lear, Kevin, committee memberWith ever-increasing core count and growing performance demand of modern data-centric applications (e.g., big data and internet-of-things (IoT) applications), energy-efficient and low-latency memory accesses and data communications (on and off the chip) are becoming essential for emerging manycore computing systems. But unfortunately, due to their poor scalability, the state-of-the-art electrical interconnects and DRAM based main memories are projected to exacerbate the latency and energy costs of memory accesses and data communications. Recent advances in silicon photonics, 3D stacking, and non-volatile memory technologies have enabled the use of cutting-edge interconnection and memory subsystems, such as photonic interconnects, 3D-stacked DRAM, and phase change memory. These innovations have the potential to enhance the performance and energy-efficiency of future manycore systems. However, despite the benefits in performance and energy-efficiency, these emerging interconnection and memory subsystems still face many technology-specific challenges along with process, environment, and workload variabilities, which negatively impact their reliability overheads and implementation feasibility. For instance, with recent advances in silicon photonics, photonic networks-on-chip (PNoCs) and core-to-memory photonic interfaces have emerged as scalable communication fabrics to enable high-bandwidth, energy-efficient, and low-latency data communications in emerging manycore systems. However, these interconnection subsystems still face many challenges due to thermal and process variations, crosstalk noise, aging, data-snooping Hardware Trojans (HTs), and high overheads of laser power generation, coupling, and distribution, all of which negatively impact reliability, security, and energy-efficiency. Along the same lines, with the advent of through-silicon via (TSV) technology, 3D-stacked DRAM architectures have emerged as small-footprint main memory solutions with relatively low per-access latency and energy costs. However, the full potential of the 3D-stacked DRAM technology remains untapped due to thermal- and scaling-induced data instability, high leakage, and high refresh rate problems along with other challenges related to 3D floorplanning and power integrity. Recent advances have also enabled Phase Change Memory (PCM) as a leading technology that can alleviate the leakage and scalability shortcomings of DRAM. But asymmetric write latency and low endurance of PCM are major challenges for its widespread adoption as main memory in future manycore systems. My research has contributed several solutions that overcome multitude of these challenges and improve the performance, energy-efficiency, security, and reliability of manycore systems integrating photonic interconnects and emerging memory (3D-stacked DRAM and phase change memory) subsystems. The main contribution of my thesis is a framework for the design and optimization of emerging interconnection and memory subsystems for future manycore computing systems. The proposed framework synergistically integrates layer-specific enhancements towards the design and optimization of emerging main memory, PNoC, and inter-chip photonic interface subsystems. In addition to subsystem-specific enhancements, we also combine enhancements across subsystems to more aggressively improve the performance, energy-efficiency, and reliability for future manycore architectures.Item Open Access Design and synthesis of hybrid nanophotonic-electric network-on-chip architectures(Colorado State University. Libraries, 2014) Bahirat, Shirish, author; Pasricha, Sudeep, advisor; Bohm, Wim, committee member; Chen, T. W., committee member; Siegel, H. J., committee memberWith increasing application complexity and improvements in CMOS process technology, chip multiprocessors (CMPs) with tens to hundreds of cores on a chip are today becoming a reality. Networks on Chip (NoCs) have emerged as a scalable communication fabric that can support high bandwidth communications in such massively parallel multi-core systems. However, traditional electrical NoC implementations today face significant challenges due to high data transfer latencies, low throughput, and high power dissipation. Silicon nanophotonics on a chip has recently been proposed to overcome limitations of electrical wires. However, designing and optimizing hybrid electro-photonic NoCs requires complex trade-offs and overcoming many design challenges such as thermal tuning, power, and crossing loss overheads. In this thesis, these challenges are addressed by proposing novel hybrid electro-photonic NoC architectures and novel synthesis hybrid NoC frameworks for emerging CMPs. The proposed hybrid electro-photonic NoC architectures are designed for waveguide-based and free-space-based silicon nanophotonics implementations. These architectures are optimized for low-cost, low-power, and low-area overhead, support dynamic reconfiguration to adapt the changing runtime traffic requirements, and have been adapted for both 2D and 3D CMPs. The proposed synthesis frameworks utilize various optimization algorithms such as evolutionary techniques, linear programming, and custom heuristics to perform rapid design space exploration of hybrid electro-photonic (2D and 3D) NoC architectures and trade-off performance and power objectives. Experimental results indicate a strong motivation to consider the proposed architectures for future CMPs, with several orders of magnitude reduction in power consumption and improvements in network throughput and access latencies, compared to traditional electrical 2D and 3D NoC architectures. Compared to other previously proposed hybrid electro-photonic NoC architectures, the proposed architectures are also shown to have lower photonic area overhead, power consumption, and energy-delay product, while maintaining competitive throughput and latency. Unlike any prior work to date, our synthesis frameworks allow further tuning and customization of our proposed architectures to meet designer-specific goals. Together, the architectural and synthesis framework contributions bring the promise of silicon nanophotonics in future massively parallel CMPs closer to reality.Item Open Access Energy- and thermal-aware resource management for heterogeneous high-performance computing systems(Colorado State University. Libraries, 2016) Oxley, Mark, author; Siegel, H. J., advisor; Pasricha, Sudeep, advisor; Maciejewski, Anthony A., committee member; Whitley, Darrell, committee memberToday's high-performance computing (HPC) systems face the issue of balancing electricity (energy) use and performance. Rising energy costs are forcing system operators to either operate within an energy budget or to reduce energy use as much as possible while still maintaining performance-based service agreements. Energy-aware resource management is one method for solving such problems. Resource management in the context of high-performance computing refers to the process of assigning and scheduling workloads to resources (e.g., compute nodes). Because the cooling systems in HPC facilities also consume a considerable amount of energy, it is important to consider the computer room air conditioning (CRAC) units as a controllable resource and to study the relationship (and energy consumption impact) between the computing and cooling systems. In this thesis, we present four primary contributing studies with differing environments and novel techniques designed for each of those environments. Each study proposes new ideas in the field of energy- and thermal-aware resource management for heterogeneous HPC systems. Our first contribution explores the problem of assigning a collection of independent tasks ("bag-of-tasks") to a heterogeneous HPC system in an energy-aware manner, where task execution times vary. We propose two new measures that consider these uncertainties with respect to makespan and energy: makespan-robustness and energy-robustness. We design resource management heuristics to either: (a) maximize makespan-robustness within an energy-robustness constraint, or (b) maximize energy-robustness within a makespan-robustness constraint. Our next contribution studies a rate-based environment where task execution rates are assigned to compute cores within the HPC facility. The performance measure in this study is the reward rate earned for executing tasks. We analyze the impact that co-location interference (i.e., the performance degradation experienced when tasks are simultaneously executing on cores that share memory resources) has on the reward rate. Novel heuristics are designed that maximize the reward rate under power and thermal constraints, considering the interactions between both computing and cooling systems. As part of the third contribution, we design new techniques for a geographical load distribution problem. That is, our proposed techniques intelligently distribute the workload to data centers located in different geographical regions that have varying energy prices and amount of renewable energy available. The novel techniques we propose use knowledge of co-location interference, thermal models, varying energy prices, and available renewable energy at each data center to minimize monetary energy costs while ensuring all tasks in the workload are completed. Our final contribution is a new energy- and thermal-aware runtime framework designed to maximize reward earned from completing individual tasks by their deadlines within energy and thermal constraints. Thermal-aware resource management strategies often consult thermal models to intelligently determine which cores in the HPC facility to assign workloads. However, the time required to perform the thermal model calculations can be prohibitive in a runtime environment. Therefore, we propose a novel offline-assisted online resource management technique where the online resource manager uses information obtained from offine-generated solutions to help in its thermal-aware decision making.Item Embargo Energy-aware workload management for geographically distributed data centers(Colorado State University. Libraries, 2023) Hogade, Ninad, author; Pasricha, Sudeep, advisor; Siegel, Howard Jay, committee member; Maciejewski, Anthony, committee member; Anderson, Chuck, committee memberCloud service providers are distributing data centers globally to reduce operating costs while also improving the quality of service by using intelligent cloud management strategies. The development of time-of-use electricity pricing and renewable energy source models has provided the means to reduce high cloud operating costs through intelligent geographical workload distribution. However, neglecting essential considerations such as data center cooling power, interference effects from workload co-location in servers, net-metering, peak demand pricing of electricity, data transfer costs, and data center queueing delay has led to sub-optimal results in prior work because these factors have a significant impact on cloud operating costs, performance, and carbon emissions. This dissertation presents a series of critical research studies addressing the vital issues of energy efficiency, carbon emissions reductions, and operating cost optimization in geographically distributed data centers. It scrutinizes different approaches to workload management, considering the diverse, dynamic, and complex nature of these environments. Starting from an exploration of energy cost minimization through sophisticated workload management techniques, the research extends to integrate network awareness into the problem, acknowledging data transfer costs and queuing delays. These works employ mathematical and game theoretic optimization to find effective solutions. Subsequently, a comprehensive survey of state-of-the-art Machine Learning (ML) techniques utilized in cloud management is discussed. Then, the dissertation traverses into the realm of Deep Reinforcement Learning (DRL) based optimization for efficient management of cloud resources and workloads. Finally, the study culminates in a novel game-theoretic DRL method, incorporating non-cooperative game theory principles to optimize the distribution of AI workloads, considering energy costs, data transfer costs, and carbon footprints. The dissertation holds significant implications for sustainable and cost-effective cloud data center workload management.Item Open Access GPU accelerated cone based shooting bouncing ray tracing(Colorado State University. Libraries, 2019) Troksa, Blake A., author; Notaros, Branislav, advisor; Pasricha, Sudeep, advisor; Chitsaz, Hamidreza, committee memberRay tracing can be used as an alternative method to solve complex Computational Electromagnetics (CEM) problems that would require significant time using traditional full-wave CEM solvers. Ray tracing is considered a high-frequency asymptotic solver, sacrificing accuracy for speed via approximation. Two prominent categories for ray tracing exist today: image theory techniques and ray launching techniques. Image theory involves the calculation of image points for each continuous plane within a structure. Ray launching ray tracing is comprised of spawning rays in numerous directions and tracking the intersections these rays have with the environment. While image theory ray tracing typically provides more accurate solutions compared to ray launching techniques, due to more exact computations, image theory is much slower than ray launching techniques due to exponential time complexity of the algorithm. This paper discusses a ray launching technique called shooting bouncing rays (SBR) ray tracing that applies NVIDIA graphics processing units (GPU) to achieve significant performance benefits for solving CEM problems. The GPUs are used as a tool to parallelize the core ray tracing algorithm and also to provide access to the NVIDIA OptiX ray tracing application programming interface (API) that efficiently traces rays within complex structures. The algorithm presented enables quick and efficient simulations to optimize the placement of communication nodes within complex structures. The processes and techniques used in the development of the solver and demonstrations of the validation and the application of the solver on various structures and its comparison to commercially available ray tracing software are presented.Item Open Access Hardware-software codesign of silicon photonic AI accelerators(Colorado State University. Libraries, 2024) Sunny, Febin P., author; Pasricha, Sudeep, advisor; Nikdast, Mahdi, advisor; Chen, Haonen, committee member; Malaiya, Yashwant K., committee memberMachine learning applications have become increasingly prevalent over the past decade across many real-world use cases, from smart consumer electronics to automotive, healthcare, cybersecurity, and language processing. This prevalence has been fueled by the emergence of powerful machine learning models, such as Deep Neural Networks (DNNs), Convolutional Neural Networks (CNNs), and Recurrent Neural Networks (RNNs). As researchers explore deeper models with higher connectivity, the computing power and the memory requirement necessary to train and utilize them also increase. Such increasing complexity also necessitates that the underlying hardware platform should consistently deliver better performance while satisfying strict power constraints. Unfortunately, the limited performance-per-watt in today's computing platforms – such as general-purpose CPUs, GPUs, and electronic neural network (NN) accelerators – creates significant challenges for the growth of new deep learning and AI applications. These electronic computing platforms face fundamental limits in the post-Moore Law era due to increased ohmic losses and capacitance-induced latencies in interconnects, as well as power inefficiencies and reliability concerns that reduce yields and increase costs with semiconductor-technology scaling. A solution to improving performance-per-watt for AI model processing is to explore more efficient hardware NN accelerator platforms. Silicon photonics has shown promise in terms of achievable energy efficiency and latency for data transfers. It is also possible to use photonic components to perform computation, e.g., matrix-vector multiplication. Such photonics-based AI accelerators can not only address the fan-in and fan-out problem with linear algebra processors, but their operational bandwidth can approach the photodetection rate (typically in the hundreds of GHz), which is orders of magnitude higher than electronic systems today that operate at a clock rate of a few GHz. A solution to the data-movement bottleneck can be the use of silicon photonics technology for photonic networks-on-chip (PNoCs), which can enable ultra-high bandwidth, low latency, and energy-efficient communication. However, to ensure reliable, efficient, and high throughput communication and computation using photonics, several challenges must be addressed first. Photonic computation is performed in the analog domain, which makes it susceptible to various noise sources and drives down the achievable resolution for representing NN model parameters. To increase the reliability of silicon photonic AI accelerators, fabrication-process variation (FPV), which is the change in physical dimensions and characteristics of devices due to imperfections in fabrication, must be addressed. FPVs induce resonant wavelength shifts that need to be compensated, for the microring resonators (MRs), which are the fundamental devices to realize photonic computation and communication in our proposed accelerator architectures, to operate correctly. Without this correction, FPVs will cause increased crosstalk and data corruption during photonic communication and can also lead to errors during photonic computation. Accordingly, the correction for FPVs is an essential part of reliable computation in silicon photonic-based AI accelerators. Even with FPV-resilient silicon photonic devices, the tuning latency incurred by thermo-optic (TO) tuning and the thermal crosstalk it can induce are significant. The latency, which can be in the microsecond range, impacts the overall throughput of the accelerator and the thermal crosstalk impacts its reliable operation. At the architectural level it is also necessary to ensure that the NN processing is done efficiently while making use of the photonic resources in terms of wavelengths, and NN model-aware decisions in terms of device deployment, arrangement, and multiply and accumulate (MAC) unit design have to be performed. To address these challenges, the major contributions of this thesis are focused on proposing a hardware-software co-design framework to enable high throughput, low latency, and energy-efficient AI acceleration across various neural network models, using silicon photonics. At the architectural level, we have proposed wavelength reuse schemes, vector decomposition, and NN-aware MAC unit designs for increased efficiency in laser power consumption. In terms of NN-aware designs, we have proposed layer-specific acceleration units, photonic batch normalization folding, and fine-grained sparse NN acceleration units. To tackle the reliability challenges introduced by FPV, we have performed device-level design-space exploration and optimization to design MRs that are more tolerant to FPVs than the state-of-the-art efforts in this area. We also adapt Thermal Eigen-mode decomposition and have devised various novel techniques to manage thermal and spectral crosstalk sources, allowing our silicon photonic-based AI accelerators to reach up to 16-bit parameter resolution per MR, which enables high accuracy for most NN models.Item Open Access Heterogeneous prioritization for network-on-chip based multi-core systems(Colorado State University. Libraries, 2013) Pimpalkhute, Tejasi, author; Pasricha, Sudeep, advisor; Bohm, Wim, committee member; Jayasumana, Anura, committee memberIn chip multi-processor (CMP) systems, communication and memory access both play an important role in influencing the performance achievable by the system. The manner in which the network packets (on-chip cache requests/responses) and off-chip memory bound packets are handled, in multi-core environment with several applications executing in parallel, determines end-to-end latencies across the network and memory. Several techniques have been proposed in the past that schedule packets in either an application-aware manner or memory requests in a DRAM row/bank locality-aware manner. Prioritization of memory requests is a major factor in increasing the overall system throughput. Moreover, with the increasing diversity in CMP systems, applying the same prioritization rules to all packets traversing the NoC as is done in the current implementations may no longer be a viable approach. In this thesis, a holistic framework is proposed that integrates novel prioritization techniques for both network and memory accesses and operates cohesively in an application-aware and memory-aware manner to optimize overall system performance. The application-aware technique makes fine grain classification of applications with a newly proposed ranking scheme. Two novel memory-prioritization algorithms are also proposed, one of which is specifically tuned for high-speed memories. Upon analyzing the fairness issues that arise in a multi-core environment, a novel strategy is proposed and employed system-wide to ensure fairness in the system. The proposed heterogeneous prioritization framework is validated using a detailed cycle-accurate full system event-driven simulator and shows significant improvement over Round Robin and other recently proposed network and memory prioritization techniques.Item Open Access High performance and energy efficient shared hybrid last level cache architecture in multicore systems(Colorado State University. Libraries, 2018) Bhosale, Swapnil, author; Pasricha, Sudeep, advisor; Roy, Sourajeet, committee member; Bohm, Wim, committee memberAs the performance gap between CPU and main memory continues to increase, it causes a significant roadblock to exascale computing. Memory performance has not kept up with CPU performance, and is becoming a bottleneck today, particularly due to the advent of data-intensive applications. To accommodate the vast amount of data required by these applications, emerging non-volatile memory technology STTRAM (Spin-Transfer Torque Random Access Memory) is a good candidate to replace or augment SRAM from last-level cache (LLC) memory because of its high capacity, good scalability, and low power consumption. However, its expensive write operations prevent it from becoming a universal memory candidate. In this thesis, we propose an SRAM-STTRAM hybrid last level cache (LLC) architecture that consumes less energy and performs better than SRAM-only and STTRAM-only LLC. We design an algorithm to reduce write operations to the STTRAM region of the hybrid LLC and consequently minimize the write energy of STTRAM. Compared to two prior state-of-the-art techniques, our proposed technique achieves 29.23% and 5.94% total LLC energy savings and 6.863% and 0.407% performance improvement for various SPLASH2 and PARSEC parallel benchmarks.Item Open Access Indoor positioning with deep learning for mobile IoT systems(Colorado State University. Libraries, 2022) Wang, Liping, author; Pasricha, Sudeep, advisor; Kim, Ryan, committee member; Zhao, Jianguo, committee memberThe development of human-centric services with mobile devices in the era of the Internet of Things (IoT) has opened the possibility of merging indoor positioning technologies with various mobile applications to deliver stable and responsive indoor navigation and localization functionalities that can enhance user experience within increasingly complex indoor environments. But as GPS signals cannot easily penetrate modern building structures, it is challenging to build reliable indoor positioning systems (IPS). Currently, Wi-Fi sensing based indoor localization techniques are gaining in popularity as a means to build accurate IPS, benefiting from the prevalence of 802.11 family. Wi-Fi fingerprinting based indoor localization has shown remarkable performance over geometric mapping in complex indoor environments by taking advantage of pattern matching techniques. Today, the two main information extracted from Wi-Fi signals to form fingerprints are Received Signal Strength Index (RSSI) and Channel State Information (CSI) with Orthogonal Frequency-Division Multiplexing (OFDM) modulation, where the former can provide the average localization error around or under 10 meters but has low hardware and software requirements, while the latter has a higher chance to estimate locations with ultra-low distance errors but demands more resources from chipsets, firmware/software environments, etc. This thesis makes two novel contributions towards realizing viable IPS on mobile devices using RSSI and CSI information, and deep machine learning based fingerprinting. Due to the larger quantity of data and more sophisticated signal patterns to create fingerprints in complex indoor environments, conventional machine learning algorithms that need carefully engineered features suffer from the challenges of identifying features from very high dimensional data. Hence, the abilities of approximation functions generated from conventional machine learning models to estimate locations are limited. Deep machine learning based approaches can overcome these challenges to realize scalable feature pattern matching approaches such as fingerprinting. However, deep machine learning models generally require considerable memory footprint, and this creates a significant issue on resource-constrained devices such as mobile IoT devices, wearables, smartphones, etc. Developing efficient deep learning models is a critical factor to lower energy consumption for resource intensive mobile IoT devices and accelerate inference time. To address this issue, our first contribution proposes the CHISEL framework, which is a Wi-Fi RSSI- based IPS that incorporates data augmentation and compression-aware two-dimensional convolutional neural networks (2D CAECNNs) with different pruning and quantization options. The proposed model compression techniques help reduce model deployment overheads in the IPS. Unlike RSSI, CSI takes advantages of multipath signals to potentially help indoor localization algorithms achieve a higher level of localization accuracy. The compensations for magnitude attenuation and phase shifting during wireless propagation generate different patterns that can be utilized to define the uniqueness of different locations of signal reception. However, all prior work in this domain constrains the experimental space to relatively small-sized and rectangular rooms where the complexity of building interiors and dynamic noise from human activities, etc., are seldom considered. As part of our second contribution, we propose an end-to-end deep learning based framework called CSILoc for Wi-Fi CSI-based IPS on mobile IoT devices. The framework includes CSI data collection, clustering, denoising, calibration and classification, and is the first study to verify the feasibility to use CSI for floor level indoor localization with minimal knowledge of Wi-Fi access points (APs), thus avoiding security concerns during the offline data collection process.Item Open Access Machine learning techniques for energy optimization in mobile embedded systems(Colorado State University. Libraries, 2012) Donohoo, Brad Kyoshi, author; Pasricha, Sudeep, advisor; Anderson, Charles, committee member; Jayasumana, Anura P., committee memberMobile smartphones and other portable battery operated embedded systems (PDAs, tablets) are pervasive computing devices that have emerged in recent years as essential instruments for communication, business, and social interactions. While performance, capabilities, and design are all important considerations when purchasing a mobile device, a long battery lifetime is one of the most desirable attributes. Battery technology and capacity has improved over the years, but it still cannot keep pace with the power consumption demands of today's mobile devices. This key limiter has led to a strong research emphasis on extending battery lifetime by minimizing energy consumption, primarily using software optimizations. This thesis presents two strategies that attempt to optimize mobile device energy consumption with negligible impact on user perception and quality of service (QoS). The first strategy proposes an application and user interaction aware middleware framework that takes advantage of user idle time between interaction events of the foreground application to optimize CPU and screen backlight energy consumption. The framework dynamically classifies mobile device applications based on their received interaction patterns, then invokes a number of different power management algorithms to adjust processor frequency and screen backlight levels accordingly. The second strategy proposes the usage of machine learning techniques to learn a user's mobile device usage pattern pertaining to spatiotemporal and device contexts, and then predict energy-optimal data and location interface configurations. By learning where and when a mobile device user uses certain power-hungry interfaces (3G, WiFi, and GPS), the techniques, which include variants of linear discriminant analysis, linear logistic regression, non-linear logistic regression, and k-nearest neighbor, are able to dynamically turn off unnecessary interfaces at runtime in order to save energy.Item Open Access Minimizing energy costs for geographically distributed heterogeneous data centers(Colorado State University. Libraries, 2018) Hogade, Ninad, author; Pasricha, Sudeep, advisor; Siegel, Howard Jay, advisor; Burns, Patrick J., committee memberThe recent proliferation and associated high electricity costs of distributed data centers have motivated researchers to study energy-cost minimization at the geo-distributed level. The development of time-of-use (TOU) electricity pricing models and renewable energy source models has provided the means for researchers to reduce these high energy costs through intelligent geographical workload distribution. However, neglecting important considerations such as data center cooling power, interference effects from task co-location in servers, net-metering, and peak demand pricing of electricity has led to sub-optimal results in prior work because these factors have a significant impact on energy costs and performance. In this thesis, we propose a set of workload management techniques that take a holistic approach to the energy minimization problem for geo-distributed data centers. Our approach considers detailed data center cooling power, co-location interference, TOU electricity pricing, renewable energy, net metering, and peak demand pricing distribution models. We demonstrate the value of utilizing such information by comparing against geo-distributed workload management techniques that possess varying amounts of system information. Our simulation results indicate that our best proposed technique is able to achieve a 61% (on average) cost reduction compared to state-of-the-art prior work.Item Open Access Perception architecture exploration for automotive cyber-physical systems(Colorado State University. Libraries, 2022) Dey, Joydeep, author; Pasricha, Sudeep, advisor; Jayasumana, Anura, committee member; Wyndom, Brett, committee memberIn emerging autonomous and semi-autonomous vehicles, accurate environmental perception by automotive cyber physical platforms are critical for achieving safety and driving performance goals. An efficient perception solution capable of high fidelity environment modeling can improve Advanced Driver Assistance System (ADAS) performance and reduce the number of lives lost to traffic accidents as a result of human driving errors. Enabling robust perception for vehicles with ADAS requires solving multiple complex problems related to the selection and placement of sensors, object detection, and sensor fusion. Current methods address these problems in isolation, which leads to inefficient solutions. For instance, there is an inherent accuracy versus latency trade-off between one stage and two stage object detectors which makes selecting an enhanced object detector from a diverse range of choices difficult. Further, even if a perception architecture was equipped with an ideal object detector performing high accuracy and low latency inference, the relative position and orientation of selected sensors (e.g., cameras, radars, lidars) determine whether static or dynamic targets are inside the field of view of each sensor or in the combined field of view of the sensor configuration. If the combined field of view is too small or contains redundant overlap between individual sensors, important events and obstacles can go undetected. Conversely, if the combined field of view is too large, the number of false positive detections will be high in real time and appropriate sensor fusion algorithms are required for filtering. Sensor fusion algorithms also enable tracking of non-ego vehicles in situations where traffic is highly dynamic or there are many obstacles on the road. Position and velocity estimation using sensor fusion algorithms have a lower margin for error when trajectories of other vehicles in traffic are in the vicinity of the ego vehicle, as incorrect measurement can cause accidents. Due to the various complex inter-dependencies between design decisions, constraints and optimization goals a framework capable of synthesizing perception solutions for automotive cyber physical platforms is not trivial. We present a novel perception architecture exploration framework for automotive cyber- physical platforms capable of global co-optimization of deep learning and sensing infrastructure. The framework is capable of exploring the synthesis of heterogeneous sensor configurations towards achieving vehicle autonomy goals. As our first contribution, we propose a novel optimization framework called VESPA that explores the design space of sensor placement locations and orientations to find the optimal sensor configuration for a vehicle. We demonstrate how our framework can obtain optimal sensor configurations for heterogeneous sensors deployed across two contemporary real vehicles. We then utilize VESPA to create a comprehensive perception architecture synthesis framework called PASTA. This framework enables robust perception for vehicles with ADAS requiring solutions to multiple complex problems related not only to the selection and placement of sensors but also object detection, and sensor fusion as well. Experimental results with the Audi-TT and BMW Minicooper vehicles show how PASTA can intelligently traverse the perception design space to find robust, vehicle-specific solutions.Item Open Access RELAX: cross-layer resource management for reliable NoC-based 2D and 3D manycore architectures in the dark silicon era(Colorado State University. Libraries, 2019) Raparti, Venkata Yaswanth, author; Pasricha, Sudeep, advisor; Jayasumana, Anura, committee member; Bohm, Willem, committee member; Kim, Ryan, committee memberEmerging 2D and 3D chip-multiprocessors (CMPs) are facing numerous challenges due to technology scaling that impact their reliability, power dissipation, performance, and security. With growing parallelism in applications and the increasing core counts, traditional resource management frameworks and critical on-chip components such as networks-on-chip (NoC) and memory controllers (MCs) do not scale well to efficiently cope with this new and complex design space of CMP design. Several phenomena are affecting the reliability of CMPs. For instance, device-level phenomena such as (Bias Temperature Instability) BTI and (Electro Migration) EM lead to permanent faults due to aging in CMOS logic and memory cells in computing cores and NoC routers of CMPs. Simultaneously, alpha particle strikes (soft errors) and power supply noise (PSN) impacts lead to transient faults across CMP components. There have been several attempts to address these challenges at the circuit and micro-architectural levels, such as guard-banding and over-provisioning of resources to the CMP. However, with increasing complexity in the architecture of today's CMPs, mechanisms to overcome these challenges at the circuit and microarchitectural levels alone, incur large overheads in power and performance. Hence, there is a need for a system-level solution that utilizes control knobs from different layers and manages the CMP reliability in runtime to efficiently minimize the adverse effects of these failure mechanisms while meeting performance and power constraints. Network-on-chip (NoC) has become the defacto communication fabric in CMP architectures. There are different types of NoC topologies and architectures that are tailored for different CMP platforms based on their communication demands. The most used topology is 2D/3D mesh-based NoC with a deadlock-free turn-model based routing scheme as it has demonstrated to be scaling well with the increasing core count. However, with unprecedented reliability and security challenges in CMP designed at the sub-nanometer technology node, the basic turn-model routing is proved to be inefficient to provide seamless communication between cores and other on-chip components. This demands for a more reliable NoC solution in 2D, and 3D CMPs. Another critical criterion while designing a CMP is NoC throughput and power consumption in CMPs with integrated manycore accelerators. Manycore accelerator platforms operate on thousands of threads with hundreds of thread blocks executing several kernels simultaneously. The core-to-memory data generated in accelerators is very high compared to a traditional CPU processor. This leads to congestion at memory controllers that demands a high bandwidth NoC with high power and area overheads, which is not scalable as a number of cores in the accelerator increases. High volumes of read reply data in manycore accelerator platforms necessitate intelligent memory scheduling along with low latency NoC to resolve the memory bottleneck issue. Mechanisms to overcome these challenges require complex architectures across CMP interconnection fabric that are designed and integrated at various global locations. Unfortunately, such global fabrication of CMP processors makes them vulnerable to security threats due to hardware Trojans that may be inserted in third-party (3PIP) NoCs. We address these issues by designing a cross-layer resource management framework called RELAX that enhances performance and security of NoC-based 2D and 3D CMPs, while meeting a diverse set of platform constraints related to the lifetime of the CMP, dark silicon power, fault tolerance, thermal and real-time application performance. At the OS-level, we have developed several techniques such as lifetime aware application mapping heuristic, adaptive application degree of parallelism (DoP), slack aware checkpointing, and aging aware NoC path allocation. At the system level, we propose dynamic voltage scheduling (DVS), and a low power checkpointing mechanism to meet the dark silicon power and application deadline constraints. At the architectural level, we introduce several novel upgrades to the architectures of NoC routers, memory controllers (MCs), and network interfaces (NIs) to improve the performance of NoC-based CMPs while minimizing the power dissipation and mitigating security threats from hardware Trojans.Item Open Access Reliability-aware and energy-efficient system level design for networks-on-chip(Colorado State University. Libraries, 2015) Zou, Yong, author; Pasricha, Sudeep, advisor; Roy, Sourajeet, committee member; Chen, Tom, committee member; Bohm, Wim, committee memberWith CMOS technology aggressively scaling into the ultra-deep sub-micron (UDSM) regime and application complexity growing rapidly in recent years, processors today are being driven to integrate multiple cores on a chip. Such chip multiprocessor (CMP) architectures offer unprecedented levels of computing performance for highly parallel emerging applications in the era of digital convergence. However, a major challenge facing the designers of these emerging multicore architectures is the increased likelihood of failure due to the rise in transient, permanent, and intermittent faults caused by a variety of factors that are becoming more and more prevalent with technology scaling. On-chip interconnect architectures are particularly susceptible to faults that can corrupt transmitted data or prevent it from reaching its destination. Reliability concerns in UDSM nodes have in part contributed to the shift from traditional bus-based communication fabrics to network-on-chip (NoC) architectures that provide better scalability, performance, and utilization than buses. In this thesis, to overcome potential faults in NoCs, my research began by exploring fault-tolerant routing algorithms. Under the constraint of deadlock freedom, we make use of the inherent redundancy in NoCs due to multiple paths between packet sources and sinks and propose different fault-tolerant routing schemes to achieve much better fault tolerance capabilities than possible with traditional routing schemes. The proposed schemes also use replication opportunistically to optimize the balance between energy overhead and arrival rate. As 3D integrated circuit (3D-IC) technology with wafer-to-wafer bonding has been recently proposed as a promising candidate for future CMPs, we also propose a fault-tolerant routing scheme for 3D NoCs which outperforms the existing popular routing schemes in terms of energy consumption, performance and reliability. To quantify reliability and provide different levels of intelligent protection, for the first time, we propose the network vulnerability factor (NVF) metric to characterize the vulnerability of NoC components to faults. NVF determines the probabilities that faults in NoC components manifest as errors in the final program output of the CMP system. With NVF aware partial protection for NoC components, almost 50% energy cost can be saved compared to the traditional approach of comprehensively protecting all NoC components. Lastly, we focus on the problem of fault-tolerant NoC design, that involves many NP-hard sub-problems such as core mapping, fault-tolerant routing, and fault-tolerant router configuration. We propose a novel design-time (RESYN) and a hybrid design and runtime (HEFT) synthesis framework to trade-off energy consumption and reliability in the NoC fabric at the system level for CMPs. Together, our research in fault-tolerant NoC routing, reliability modeling, and reliability aware NoC synthesis substantially enhances NoC reliability and energy-efficiency beyond what is possible with traditional approaches and state-of-the-art strategies from prior work.