Browsing by Author "Rajopadhye, Sanjay, advisor"
Now showing 1 - 20 of 23
Results Per Page
Sort Options
Item Open Access A GPU accelerated RNA-RNA interaction program(Colorado State University. Libraries, 2021) Gildemaster, Brandon, author; Rajopadhye, Sanjay, advisor; Chitsaz, Hamidreza, committee member; Abdo, Zaid, committee memberRNA-RNA interaction (RRI) is important in processes like gene regulation, and is known to play roles in diseases including cancer and Alzheimer's. Large RRI computations run for days, weeks or even months, because the algorithms have time and space complexity of, respectively, O(N3M3) and O(N2M2), for sequences length N and M, and there is a need for high-throughput RRI tools. GPU parallelization of such algorithms is a challenge. We first show that the most computationally expensive part of base pair maximization (BPM) algorithms comprises O(N3) instances of upper banded tropical matrix products. We develop the first GPU library for this attaining close to theoretical machine peak (TMP). We next optimize other (fifth degree polynomial) terms in the computation and develop the first GPU implementation of the complete BPMax algorithm. We attain 12% of GPU TMP, a significant speedup over the original parallel CPU implementation, which attains less than 1% of CPU TMP. We also perform a large scale study of three small viral RNAs, hypothesized to be relevant to COVID-19.Item Open Access Accelerating the BPMax algorithm for RNA-RNA interaction(Colorado State University. Libraries, 2021) Mondal, Chiranjeb, author; Rajopadhye, Sanjay, advisor; Pouchet, Louis-Noel, committee member; Betten, Anton, committee memberRNA-RNA interactions (RRIs) are essential in many biological processes, including gene tran- scription, translation, and localization. They play a critical role in diseases such as cancer and Alzheimer's. An RNA-RNA interaction algorithm uses a dynamic programming algorithm to predict the secondary structure and suffers very high computational time. Its high complexity (Θ(N3M3) in time and Θ(N2M2) in space) makes it both essential and a challenge to parallelize. RRI programs are developed and optimized by hand most of the time, which is prone to human error and costly to develop and maintain. This thesis presents the parallelization of an RRI program - BPMax on a single shared memory CPU platform. From a mathematical specification of the dynamic programming algorithm, we generate highly optimized code that achieves over 100× speedup over the baseline program that uses a standard 'diagonal-by-diagonal' execution order. We achieve 100 GFLOPS, which is about a fourth of our platform's peak theoretical single-precision performance for max-plus computation. The main kernel in the algorithm, whose complexity is Θ(N3M3) attains 186 GFLOPS. We do this with a polyhedral code generation tool, A L P H A Z, which takes user-specified mapping directives and automatically generates optimized C code that enhances parallelism and locality. A L P H A Z allows the user to explore various schedules, memory maps, parallelization approaches, and tiling of the most dominant part of the computation.Item Open Access Analytical cost metrics: days of future past(Colorado State University. Libraries, 2019) Prajapati, Nirmal, author; Rajopadhye, Sanjay, advisor; Böhm, Wim, committee member; Chong, Edwin, committee member; Pouchet, Louis-Noël, committee memberFuture exascale high-performance computing (HPC) systems are expected to be increasingly heterogeneous, consisting of several multi-core CPUs and a large number of accelerators, special-purpose hardware that will increase the computing power of the system in a very energy-efficient way. Specialized, energy-efficient accelerators are also an important component in many diverse systems beyond HPC: gaming machines, general purpose workstations, tablets, phones and other media devices. With Moore's law driving the evolution of hardware platforms towards exascale, the dominant performance metric (time efficiency) has now expanded to also incorporate power/energy efficiency. This work builds analytical cost models for cost metrics such as time, energy, memory access, and silicon area. These models are used to predict the performance of applications, for performance tuning, and chip design. The idea is to work with domain specific accelerators where analytical cost models can be accurately used for performance optimization. The performance optimization problems are formulated as mathematical optimization problems. This work explores the analytical cost modeling and mathematical optimization approach in a few ways. For stencil applications and GPU architectures, the analytical cost models are developed for execution time as well as energy. The models are used for performance tuning over existing architectures, and are coupled with silicon area models of GPU architectures to generate highly efficient architecture configurations. For matrix chain products, analytical closed form solutions for off-chip data movement are built and used to minimize the total data movement cost of a minimum op count tree.Item Open Access Automatic creation of tile size selection models using neural networks(Colorado State University. Libraries, 2010) Yuki, Tomofumi, author; Rajopadhye, Sanjay, advisor; Anderson, Charles, committee member; Casterella, Gretchen, committee member; Strout, Michelle, committee memberTiling is a widely used loop transformation for exposing/exploiting parallelism and data locality. Effective use of tiling requires selection and tuning of the tile sizes. This is usually achieved by hand-crafting tile size selection (TSS) models that characterize the performance of the tiled program as a function of tile sizes. The best tile sizes are selected by either directly using the TSS model or by using the TSS model together with an empirical search. Hand-crafting accurate TSS models is hard, and adapting them to different architecture/compiler, or even keeping them up-to-date with respect to the evolution of a single compiler is often just as hard. Instead of hand-crafting TSS models, can we automatically learn or create them? In this paper, we show that for a specific class of programs fairly accurate TSS models can be automatically created by using a combination of simple program features, synthetic kernels, and standard machine learning techniques. The automatic TSS model generation scheme can also be directly used for adapting the model and/or keeping it up-to-date. We evaluate our scheme on six different architecture-compiler combinations (chosen from three different architectures and four different compilers). The models learned by our method have consistently shown near-optimal performance (within 5% of the optimal on average) across the tested architecture-compiler combinations.Item Open Access Automatic parallelization of "inherently" sequential nested loop programs(Colorado State University. Libraries, 2011) Zou, Yun, author; Rajopadhye, Sanjay, advisor; Strout, Michelle, committee member; Bohm, A. P. Willem, committee member; Breidt, F. Jay, committee memberMost automatic parallelizers are based on detection of independent operations, and most of them cannot do anything if there is a true dependence between operations. However, there exists a class of programs for which this can be surmounted based on the nature of the operations. The standard and obvious cases are reductions and scans, which normally occur within loops. Existing work that deals with complicated reductions and scans normally focuses on the formalism, not the implementation. To help eliminate the gap between the formalism and implementation, we present a method for automatically parallelizing such "inherently" sequential programs. Our method is based on exact dependence analysis in the polyhedral model, and we formulate the problem as a detection that the loop body performs a computation that is equivalent to a matrix multiplication over a semiring. It handles both a single loop as well as arbitrarily nested loops. We also deal with mutually dependent variables in the loop. Our scan detection is implemented in a polyhedral program transformation and code generation system (AlphaZ) and used to generate OpenMP code. We also present optimization strategies to help improve the performance of the generated code. Experiments on examples demonstrate the scalability of programs parallelized by our implementation.Item Open Access Automating the derivation of memory allocations for acceleration of polyhedral programs(Colorado State University. Libraries, 2024) Ferry, Corentin, author; Rajopadhye, Sanjay, advisor; Derrien, Steven, advisor; Wilson, Jesse, committee member; Pasricha, Sudeep, committee member; McClurg, Jedidiah, committee member; Sadayappan, Ponnuswamy, committee member; de Dinechin, Florent, committee member; Collange, Caroline, committee memberAs processors compute power keeps increasing, so do their demands in memory accesses: some computations will require a higher bandwidth and exhibit regular memory access patterns, others will require a lower access latency and exhibit random access patterns. To cope with all demands, memory technologies are becoming diverse. It is then necessary to adapt both programs and hardware accelerators to the memory technology they use. Notably, memory access patterns and memory layouts have to be optimized. Manual optimization can be extremely tedious and does not scale to a large number of processors and memories, where automation becomes necessary. In this Ph.D dissertation, we suggest several automated methods to derive data layouts from programs, notably for FPGA accelerators. We focus on getting the best throughput from high-latency, high-bandwidth memories and, for all kinds of memories, the lowest redundancy while preserving contiguity. To this effect, we introduce mathematical analyses to partition the data flow of a program with uniform and affine dependence patterns, propose memory layouts and automation techniques to get optimized FPGA accelerators.Item Open Access Beyond shared memory loop parallelism in the polyhedral model(Colorado State University. Libraries, 2013) Yuki, Tomofumi, author; Rajopadhye, Sanjay, advisor; Böhm, Wim, committee member; Strout, Michelle M., committee member; Chong, Edwin, committee memberWith the introduction of multi-core processors, motivated by power and energy concerns, parallel processing has become main-stream. Parallel programming is much more difficult due to its non-deterministic nature, and because of parallel programming bugs that arise from non-determinacy. One solution is automatic parallelization, where it is entirely up to the compiler to efficiently parallelize sequential programs. However, automatic parallelization is very difficult, and only a handful of successful techniques are available, even after decades of research. Automatic parallelization for distributed memory architectures is even more problematic in that it requires explicit handling of data partitioning and communication. Since data must be partitioned among multiple nodes that do not share memory, the original memory allocation of sequential programs cannot be directly used. One of the main contributions of this dissertation is the development of techniques for generating distributed memory parallel code with parametric tiling. Our approach builds on important contributions to the polyhedral model, a mathematical framework for reasoning about program transformations. We show that many affine control programs can be uniformized only with simple techniques. Being able to assume uniform dependences significantly simplifies distributed memory code generation, and also enables parametric tiling. Our approach implemented in the AlphaZ system, a system for prototyping analyses, transformations, and code generators in the polyhedral model. The key features of AlphaZ are memory re-allocation, and explicit representation of reductions. We evaluate our approach on a collection of polyhedral kernels from the PolyBench suite, and show that our approach scales as well as PLuTo, a state-of-the-art shared memory automatic parallelizer using the polyhedral model. Automatic parallelization is only one approach to dealing with the non-deterministic nature of parallel programming that leaves the difficulty entirely to the compiler. Another approach is to develop novel parallel programming languages. These languages, such as X10, aim to provide highly productive parallel programming environment by including parallelism into the language design. However, even in these languages, parallel bugs remain to be an important issue that hinders programmer productivity. Another contribution of this dissertation is to extend the array dataflow analysis to handle a subset of X10 programs. We apply the result of dataflow analysis to statically guarantee determinism. Providing static guarantees can significantly increase programmer productivity by catching questionable implementations at compile-time, or even while programming.Item Open Access Detection of linear algebra operations in polyhedral programs(Colorado State University. Libraries, 2016) Iooss, Guillaume, author; Rajopadhye, Sanjay, advisor; Alias, Christophe, advisor; Darte, Alain, advisor; Clauss, Philippe, committee member; Sankaranarayanan, Sriram, committee member; Thomassé, Stephan, committee member; Chitsaz, Hamid, committee member; Mueller, Jennifer, committee memberWriting a code which uses an architecture at its full capability has become an increasingly difficult problem over the last years. For some key operations, a dedicated accelerator or a finely tuned implementation exists and delivers the best performance. Thus, when compiling a code, identifying these operations and issuing calls to their high-performance implementation is attractive. In this dissertation, we focus on the problem of detection of these operations. We propose a framework which detects linear algebra subcomputations within a polyhedral program. The main idea of this framework is to partition the computation in order to isolate different subcomputations in a regular manner, then we consider each portion of the computation and try to recognize it as a combination of linear algebra operations. We perform the partitioning of the computation by using a program transformation called monoparametric tiling. This transformation partitions the computation into blocks, whose shape is some homothetic scaling of a fixed-size partitioning. We show that the tiled program remains polyhedral while allowing a limited amount of parametrization: a single size parameter. This is an improvement compared to the previous work on tiling, that forced us to choose between these two properties. Then, in order to recognize computations, we introduce a template recognition algorithm. This template recognition algorithm is built on a state-of-the-art program equivalence algorithm. We also propose several extensions in order to manage some semantic properties. Finally, we combine these two previous contributions into a framework which detects linear algebra subcomputations. A part of this framework is a library of template, based on the BLAS specification. We demonstrate our framework on several applications.Item Open Access Extending and validating the stencil processing unit(Colorado State University. Libraries, 2016) Rajasree, Revathy, author; Rajopadhye, Sanjay, advisor; Pasricha, Sudeep, committee member; Anderson, Charles W., committee memberStencils are an important class of programs that appear in the core of many scientific and general-purpose applications. These compute-intensive kernels can benefit heavily from the massive compute power of accelerators like the GPGPU. However, due to the absence of any form of on-chip communication between the coarse-grain processors on a GPU, any data transfer/synchronization between the dependent tiles in stencil computations has to happen through the off-chip (global) memory, which is quite energy-expensive. In the road to exascale computing, energy is becoming an important cost metric. The need for hardware and software that can collaboratively work towards reducing energy consumption of a system is becoming more and more important. To make the execution of dense stencils more energy efficient, Rajopadhye et al. proposed the GPGPU-based accelerator called Stencil Processing Unit that introduces a simple neighbor-to-neighbor communication between the Streaming Multiprocessors (SM) on the GPU, thereby allowing some restricted data sharing between consecutive threadblocks. The SPU includes special storage units, called Communication Buffers, to orchestrate this data transfer and also provides an explicit mechanism for inter-threadblock synchronization by way of a special instruction. It claims to achieve energy-efficiency, compared to GPUs, by reducing the number of off-chip accesses in stencils which in turn reduces the dynamic energy overhead. Uguen developed a cycle-accurate performance simulator for the SPU, called SPU-Sim, and evaluated it using a matrix multiplication kernel which was not suitable for this accelerator. This work focuses on extending the SPU-Sim and evaluating the SPU architecture using a more insightful benchmark. We introduce a producer-consumer based inter-block synchronization approach on the SPU, which is more efficient than the previous global synchronization, and an overlapped multi-pass execution model in the SPU runtime system. These optimizations have been implemented into SPU-Sim. Furthermore, the existing GPUWattch power model in the simulator has been refined to provide better power estimates for the SPU architecture. The improved architecture has been evaluated using a simple 2-D stencil benchmark and we observe an average of 16% savings in dynamic energy on SPU compared to a fairly close GPU platform. Nonetheless, the total energy consumption on SPU is still comparatively high due to the static energy component. This high static energy on SPU is a direct impact of the increased leakage power of the platform resulting from the inclusion of special load/store units. Our conservative estimates indicate that replacing the current design of these L/S units with DMA engines can bring about a 15% decrease in the current leakage power of the SPU and this can help SPU outperform GPU in terms of energy.Item Open Access Hardware implementation and design space exploration for Wave 2D and Jacobi 2D stencil computations(Colorado State University. Libraries, 2017) Chandramohan, Rajbharath, author; Rajopadhye, Sanjay, advisor; Pinaud, Oliver, committee member; Pasricha, Sudeep, committee memberHardware accelerators are highly optimized functional blocks designed to perform specific tasks from the CPU at a higher performance. We developed a hardware accelerator for Jacobi 2D and Wave 2D algorithms, two computations with a stencil pattern. They are used in a lot of scientific applications in the field of acoustics, electro magnetics and Fluid dynamics. These problems have large problem sizes, memory limitations and bandwidth constraints that result in long run times on large problems. Hence, an approach which increases the performance of these problems that reduces bandwidth requirement is necessary. We developed analytical models depicting the performance, Bandwidth and Area models for the Wave 2D algorithm and Jacobi 2D algorithm and solved them for the optimal solution using posynomials and positivity property in MATLAB and using Excel Solver. We split the computation into two levels of tiling. The first level called passes is a rectangular prism that runs through the 3-D iteration space. Each pass is mapped to a grid of processing elements(PEs) in the hardware accelerator. The second level of tiling splits the vertical prism into smaller prisms executed by a single PE. These optimizations are implemented in Verilog using Altera Quartus and simulated using ModelSIM. Results from ModelSIM provides an accurate model and an experimental verification of the design. We also achieved improved performance and lower bandwidth.Item Open Access Localized anomaly detection via hierarchical integrated activity discovery(Colorado State University. Libraries, 2014) Chockalingam, Thiyagarajan, author; Rajopadhye, Sanjay, advisor; Anderson, Chuck, advisor; Pasricha, Sudeep, committee member; Bohm, Wim, committee memberWith the increasing number and variety of camera installations, unsupervised methods that learn typical activities have become popular for anomaly detection. In this thesis, we consider recent methods based on temporal probabilistic models and improve them in multiple ways. Our contributions are the following: (i) we integrate the low level processing and the temporal activity modeling, showing how this feedback improves the overall quality of the captured information, (ii) we show how the same approach can be taken to do hierarchical multi-camera processing, (iii) we use spatial analysis of the anomalies both to perform local anomaly detection and to frame automatically the detected anomalies. We illustrate the approach on both traffic data and videos coming from a metro station. We also investigate the application of topic models in Brain Computing Interfaces for Mental Task classification. We observe a classification accuracy of up to 68% for four Mental Tasks on individual subjects.Item Open Access Max-plus matrix multiplication library for GPUs - MPMML(Colorado State University. Libraries, 2019) Ghalsasi, Prerana Prakash, author; Rajopadhye, Sanjay, advisor; Bohm, Wim, committee member; Pasricha, Sudeep, committee memberMax-Plus algebra finds its applications in discrete event simulations, dynamic programming, biological sequence comparisons etc. Although there exist highly tuned libraries like CUDA Linear Algebra Subprograms (CuBLAS) [1] for matrix operations, they implement the standard matrix-multiplication (multiply-add) for floating points. We found no standard library for Max- Plus-Matrix-Multiplication (MPMM) on integers. Hence,we developed a highly tuned parallelized MPMM library kernel. We chose GPUs as hardware platform for this work because of their significantly more parallelism and arithmetic functional units as compared to CPUs. We designed this kernel to be portable across three successive Nvidia GPU architectures and it achieves performance in the range 3065 GOPs/S - 3631 GOPs/S on all of these architectures. We closely followed the benchmarking approach described by Volkov et al. [2] when they contributed to cuBLAS. This MPMM kernel can be part of a max-plus algebra library for GPUs and can help speed up Biological Sequence comparison applications like BPMax.Item Open Access Optimal design space exploration for FPGA-based accelerators: a case study on 1-D FDTD(Colorado State University. Libraries, 2015) Puranik, Mugdha, author; Rajopadhye, Sanjay, advisor; Pasricha, Sudeep, committee member; Malaiya, Yashwant, committee memberHardware accelerators are optimized functional blocks designed to offload specific tasks from the CPU, speed up them up and reduce their dynamic power consumption. It is important to develop a methodology to efficiently implement critical algorithms on the hardware accelerator and do systematic design space exploration to identify optimal designs. In this thesis, we design, as a case study, a hardware accelerator for the 1-D Finite Difference Time Domain (FDTD) algorithm, a compute intensive technique for modeling electromagnetic behavior. Memory limitations and bandwidth constraints result in long run times on large problems. Hence, an approach which increases the speed of the FDTD method and reduces bandwidth requirement is necessary. To achieve this, we design an FPGA based hardware accelerator. We implement the accelerator based on time-space tiling. In our design, p processing elements (PEs) execute p parallelogram shaped tiles in parallel, each of which constitutes one tile pass. Our design uses a small amount of redundant computation to enable all PEs to start "nearly" concurrently, thereby fully exploiting the available parallelism. A further optimization allows us to reduce the main memory data transfers of this design by a factor of two. These optimizations are integrated in hardware, and implemented in Verilog in Altera's Quartus II, yielding a PE that delivers a throughput of one "iteration (i.e., two results) per cycle". To explore the feasible design space systematically, we formulate an optimization problem with the objective of minimizing the total execution time for given resource constraints. We solve the optimization problem analytically, and therefore have a provably optimal design in the feasible space. We also observe that for different problem sizes reveal that the optimal design may not always match the common sense intuition.Item Open Access Polyhedral optimizations of RNA-RNA interaction computations(Colorado State University. Libraries, 2017) Varadarajan, Swetha, author; Rajopadhye, Sanjay, advisor; Bohm, Wim, committee member; Wilson, Jesse, committee memberStudying RNA-RNA interaction has led to major successes in the treatment of some cancers, including colon, breast and pancreatic cancer by suppressing the gene expression involved in the development of these diseases. The problem with such programs is that they are computationally and memory intensive: O(N4) space and O(N6) time complexity. Moreover, the entire application is complicated, and involves many mutually recursive data variables. We address the problem of speeding up a surrogate kernel (named OSPSQ) that captures the main dependence pattern found in two widely used RNA-RNA interaction applications IRIS and piRNA. The structure of the OSPSQ kernel perfectly fits the constraints of the polyhedral model, a well-developed technology for optimizing codes that belong to many specialized domains. However, the current state-of-the-art automatic polyhedral tools do not significantly improve the performance of the baseline implementation of OSPSQ. With simple techniques like loop permutation and skewing, we achieve an average of 17x sequential and 31x parallel speedup on a standard modern multi-core platform (Intel Broadwell, E5-1650v4). This performance represents 75% and 88% of attainable single-core and multi-core L1 bandwidth. For further performance improvement, we describe how to tile all six dimensions and also formulate the associated memory trade-off. In the future, we plan to implement these tiling strategies, explore the performance of the code for various tile sizes and optimize the whole piRNA application.Item Open Access Reducing off-chip memory accesses of wavefront parallel programs in Graphics Processing Units(Colorado State University. Libraries, 2014) Ranasinghe, Waruna, author; Rajopadhye, Sanjay, advisor; Bohm, Wim, committee member; Oprea, Iuliana, committee memberThe power wall is one of the major barriers that stands on the way to exascale computing. To break the power wall, overall system power/energy must be reduced, without affecting the performance. We can decrease energy consumption by designing power efficient hardware and/or software. In this thesis, we present a software approach to lower energy consumption of programs targeted for Graphics Processing Units (GPUs). The main idea is to reduce energy consumption by minimizing the amount of off-chip (global) memory accesses. Off-chip memory accesses can be minimized by improving the last level (L2) cache hits. A wavefront is a set of data/tiles that can be processed concurrently. A kernel is a function that get executed in GPU. We propose a novel approach to implement wavefront parallel programs on GPUs. Instead of using one kernel call per wavefront like in the traditional implementation, we use one kernel call for the whole program and organize the order of computations in such a way that L2 cache reuse is achieved. A strip of wavefronts (or a pass) is a collection of partial wavefronts. We exploit the non-preemptive behavior of the thread block scheduler to process a strip of wavefronts (i.e., a pass) instead of processing a complete wavefront at a time. The data transfered by a partial wavefront in a pass is small enough to fit in L2 cache, so that, successive partial wavefronts in the pass reuse the data in L2 cache. Hence the number of off-chip memory accesses is significantly pruned. We also introduce a technique to communicate and synchronize between two thread blocks without limiting the number of thread blocks per kernel or SM. This technique is used to maintain the order of wavefronts. We have analytically shown and experimentally validated the amount of reduction in off-chip memory accesses in our approach. The off-chip memory reads and writes are decreased by a factor of 45 and 3 respectively. We have shown that if GPUs incorporate L2 cache with write-back cache write policy, then off-chip memory writes also get reduced by a factor of 45. Our approach provides 98% and 74% L2 cache read hits and total cache hits respectively and the traditional approach reports only 2% and 1% respectively.Item Open Access Revisiting sparse dynamic programming for the 0/1 Knapsack Problem(Colorado State University. Libraries, 2019) Sifat, Tarequl Islam, author; Rajopadhye, Sanjay, advisor; Pouchet, Louis Noel, committee member; Betten, Anton, committee memberThe 0/1-Knapsack Problem is a classic NP-hard problem. There are two common approaches to obtain the exact solution: branch-and-bound (BB) and dynamic programming (DP). A socalled, "sparse" DP algorithm (SKPDP) that performs fewer operations than the standard algorithm (KPDP) is well known. To the best of our knowledge, there has been no quantitative analysis of the benefits of sparsity. We provide a careful empirical evaluation of SKPDP and observe that for a "large enough" capacity, C, the number of operations performed by SKPDP is invariant with respect to C for many problem instances. This leads to the possibility of an exponential improvement over the conventional KPDP. We experimentally explore SKPDP over a large range of knapsack problem instances and provide a detailed study of the attributes that impact the performance. DP algorithms have a nice regular structure and are amenable to highly parallel implementations. However, due to the dependence structure, parallelizing SKPDP is challenging. We propose two parallelization strategies (fine-grain and coarse-grain) for SKPDP on modern multi-core processors and demonstrate a scalable improvement in the performance.Item Open Access RNA secondary structure prediction using AlphaZ(Colorado State University. Libraries, 2010) Pathan, Tanveer, author; Rajopadhye, Sanjay, advisor; Pasricha, Sudeep, committee member; Bohm, A. P. Willem, committee memberOptimizing complex algorithms using conventional programming languages can be a difficult task. The performance aspect of such implementations relies on the programmer and the target architectures. Minor alterations to the algorithm can result in a considerable amount of reprogramming effort. In our work, we experiment with equational programming using the AlphaZ tool. We illustrate our work using a fairly complex algorithm for RNA secondary structure prediction. The algorithm is extracted from the UNAFold software package, a huge C library, by identifying the time consuming parts. It is then represented as equations, which are transformed for optimizations, followed by code generation and finally validating the generated code by plugging it back into the original C program. The algorithm, in its basic form, has complexity O(N4). We show that the optimized O(N3) algorithm can be derived systematically using AlphaZ. We used the AlphaZ system to automatically generate a C program that implements this improved algorithm. Our work forms the basis for future optimizations and also acts as a case-study for polyhedral equational programming in the real world.Item Open Access Scalable and efficient tools for multi-level tiling(Colorado State University. Libraries, 2008) Renganarayana, Lakshminarayanan, author; Rajopadhye, Sanjay, advisorIn the era of many-core systems, application performance will come from parallelism and data locality. Effective exploitation of these require explicit (re)structuring of the applications. Multilevel (or hierarchical) tiling is one such structuring technique used in almost all high-performance implementations. Lack of tool support has limited the use of multi-level tiling to program optimization experts. We present solutions to two fundamental problems in multi-level tiling, viz., optimal tile size selection and parameterized tiled loop generation. Our solutions provide scalable and efficient tools for multi-level tiling. Parameterized tiled code refers to tiled loops where the tile sizes are not (fixed) compile-time constants but are left as symbolic parameters. It can enable selection and adaptation of tile sizes across a spectrum of stages through compilation to run-time. We introduce two polyhedral sets, viz., inset and outset, and use them to develop a variety of scalable and efficient multi-level tiled loop generation algorithms. The generation efficiency and code quality are demonstrated on a variety of benchmarks such as stencil computations and matrix subroutines from BLAS. Our technique can generate tiled loop nests with parameterized, fixed or mixed tile sizes, thereby providing a one-size-fits all solution ideal for inclusion in production compilers. Optimal tile size selection (TSS) refers to the selection of tile sizes that optimize some cost (e.g., execution time) model. We show that these cost models share a fundamental mathematical property, viz., positivity, that allows us to reduce optimal TSS to convex optimization problems. Almost all TSS models proposed in the literature for parallelism, caches, and registers, lend themselves to this reduction. We present the reduction of five different TSS models proposed in the literature by different authors in a variety of tiling contexts. Our convex optimization based TSS framework is the first one to provide a solution that is both efficient and scalable to multiple levels of tiling.Item Open Access Some aspects of the computational complexity in the design of islanded microgrids, design and analysis of blackstart sequences for a notional microgrid(Colorado State University. Libraries, 2012) Natarajan, Sudarshan Ananda, author; Suryanarayanan, Siddharth, advisor; Rajopadhye, Sanjay, advisor; Zimmerle, Dan, committee member; Putkaradze, Vakhtang, committee memberThe US grid represents more than $1 trillion in assets and serves over 100 million customers. But the grid is an aging system, and was built as centralized system architecture. The Title XIII of the Energy Independence and Security Act of 2007 outlined the 'Smart Grid Initiative' (SGI) as an official policy for modernization of the United States electricity transmission and distribution system to improve reliability and upgrade infrastructure to meet the ever increasing demand in electricity. The distribution feeder is the final link between the generation units and the end user. Distribution networks were traditionally designed in a radial topology since such configurations resulted in simpler protective schemes. More recently, there has been a renewed focus on distribution feeder reconfiguration. Reconfiguration at the distribution level can be achieved by the use of switches and sectionalizers. This enables portions of the distribution network to be reconfigured dynamically to improve reliability, hence enabling control on the topological structure of the distribution feeder. A related feature is the microgrid, or an islanded distributed resource (DR). The IEEE 1547.4 Standard on 'The guide for design, operation and integration of DR island systems with electric power supply' defines a microgrid as an electric power system that has the following properties: • a local distributed resource and load • the ability to disconnect from and parallel with the area electric power supply • include the local electric power supply and portions of area electric power supply A microgrid is capable of disconnecting from the main grid, on sensing a disturbance on the main grid, to maintain reliable supply of electricity to the constituent end-user loads. In this work, the computational complexity of the design of islanded microgrids by the optimal addition of network feeders in a legacy radial electric distribution system is identified, and a technique to accelerate the process of finding Pareto optimal solutions to the problem is provided. The next part of this thesis models a notional microgrid with blackstart capabilities. Microgrids that cannot continue uninterrupted supply to all local loads on disconnection from main grid are required to follow a sequence for startup which is known as the blackstart sequence. In this work, the various generation resources in the notional microgrid are studied and a blackstart sequence is engineered.Item Open Access The AlphaZ Verifier(Colorado State University. Libraries, 2011) Basupalli, Vamshi, author; Rajopadhye, Sanjay, advisor; Strout, Michelle, committee member; Pasricha, Sudeep, committee memberIn the context of a compiler research framework, it is useful to have a tool that can certify that a proposed transformation is legal according to the semantics of the source language and a model of the target architecture (sequential, SIMD, shared memory, etc.) This thesis describes such a tool, the AlphaZ Verifier developed for a system for parallelizing and transforming programs in Alphabets, a high level polyhedral equational language. Our Verifier takes an alphabets program and a proposed target mapping as input. The mapping is very general and includes a proposed space-time mapping, a memory mapping, and a tiling specification (specifically which of the dimensions after the space-time mapping are to be tiled). The Verifier first checks whether the space-time mapping is legal, i.e., no dependences are violated. Next it validates the memory mapping by ensuring that no live values are overwritten before their final use. Finally, it checks tiling legality by ensuring that the proposed tiling dimensions are fully permutable. We show how all the necessary checks are reduced to polyhedron emptiness checking so that a single back-end engine can be used for each of these steps.