Repository logo

Theses and Dissertations

Permanent URI for this collection


Recent Submissions

Now showing 1 - 20 of 291
  • ItemOpen Access
    Transformer, diffusion, and GAN-based augmentations for contrastive learning of visual representations
    (Colorado State University. Libraries, 2024) Armstrong, Samuel, author; Pallickara, Sangmi, advisor; Pallickara, Shrideep, advisor; Ghosh, Sudipto, committee member; Breidt, F. Jay, committee member
    Generative modeling and self-supervised learning have emerged as two of the most prominent fields of study in machine learning in recent years. Generative models are able to learn detailed visual representations that can then be used to generate synthetic data. Modern self-supervised learning methods are able to extract high-level visual information from images in an unsupervised manner and then apply this information to downstream tasks such as object detection and segmentation. As generative models become more and more advanced, we want to be able to extract their learned knowledge and then apply it to downstream tasks. In this work, we develop Generative Contrastive Learning (GCL), a methodology that uses contrastive learning to extract information from modern generative models. We define GCL's high-level components: an encoder, feature map augmenter, decoder, handcrafted augmenter, and contrastive learning model and demonstrate how to apply GCL to the three major types of large generative models: GANs, Diffusion Models, and Image Transformers. Due to the complex nature of generative models and the near-infinite number of unique images they can produce, we have developed several methodologies to synthesize images in a manner that compliments the augmentation-based learning that is used in contrastive learning frameworks. Our work shows that applying these large generative models to self-supervised learning can be done in a computationally viable manner without the use of large clusters of high-performance GPUs. Finally, we show the clear benefit of leveraging generative models in a contrastive learning setting using standard self-supervised learning benchmarks.
  • ItemOpen Access
    Automating the derivation of memory allocations for acceleration of polyhedral programs
    (Colorado State University. Libraries, 2024) Ferry, Corentin, author; Rajopadhye, Sanjay, advisor; Derrien, Steven, advisor; Wilson, Jesse, committee member; Pasricha, Sudeep, committee member; McClurg, Jedidiah, committee member; Sadayappan, Ponnuswamy, committee member; de Dinechin, Florent, committee member; Collange, Caroline, committee member
    As processors compute power keeps increasing, so do their demands in memory accesses: some computations will require a higher bandwidth and exhibit regular memory access patterns, others will require a lower access latency and exhibit random access patterns. To cope with all demands, memory technologies are becoming diverse. It is then necessary to adapt both programs and hardware accelerators to the memory technology they use. Notably, memory access patterns and memory layouts have to be optimized. Manual optimization can be extremely tedious and does not scale to a large number of processors and memories, where automation becomes necessary. In this Ph.D dissertation, we suggest several automated methods to derive data layouts from programs, notably for FPGA accelerators. We focus on getting the best throughput from high-latency, high-bandwidth memories and, for all kinds of memories, the lowest redundancy while preserving contiguity. To this effect, we introduce mathematical analyses to partition the data flow of a program with uniform and affine dependence patterns, propose memory layouts and automation techniques to get optimized FPGA accelerators.
  • ItemOpen Access
    Deep learning for bioinformatics sequences: RNA basecalling and protein interactions
    (Colorado State University. Libraries, 2024) Neumann, Don, author; Ben-Hur, Asa, advisor; Beveridge, Ross, committee member; Blanchard, Nathaniel, committee member; Reddy, Anireddy, committee member
    In the interdisciplinary field of bioinformatics, sequence data for biological problems comes in many different forms. This ranges from proteins, to RNA, to the ionic current for a strand of nucleotides from an Oxford Nanopore Technologies sequencing device. This data can be used to elucidate the fundamentals of biological processes on many levels, which can help humanity with everything from drug design to curing disease. All of our research focuses on biological problems encoded as sequences. The main focus of our research involves Oxford Nanopore Technology sequencing devices which are capable of directly sequencing long read RNA strands as is. We first concentrate on improving the basecalling accuracy for RNA, and have published a paper with a novel architecture achieving state-of-the-art performance. The basecalling architecture uses convolutional blocks, each with progressively larger kernel sizes which improves accuracy for the noisy nature of the data. We then describe ongoing research into the detection of post-transcriptional RNA modifications in nanopore sequencing data. Building on our basecalling research, we are able to discern modifications with read level resolution. Our work will facilitate research into the detection of N6-methyladeosine (m6A) while also furthering progress in the detection of other post-transcriptional modifications. Finally, we recount our recently accepted paper regarding protein-protein and host-pathogen interaction prediction. We performed experiments demonstrating faulty experimental design for interaction prediction which have plagued the field, giving the faulty impression the problem has been solved. We then provide reasoning and recommendations for future work.
  • ItemOpen Access
    Towards fair and efficient distributed intelligence
    (Colorado State University. Libraries, 2024) Gorbett, Matt, author; Ray, Indrakshi, advisor; Shirazi, Hossein, committee member; Simske, Steve, committee member; Jayasumana, Anura, committee member
    Artificial Intelligence is rapidly advancing the modern technological landscape. Alongside this progress, the ubiquitous presence of computational devices has created unique opportunities to deploy intelligent systems in novel environments. For instance, resource constrained machines such as IoT devices have the potential to enhance our world through the use of Deep Neural Networks (DNNs). However, modern DNNs suffer from high computational complexity and are often relegated to specialized hardware, a bottleneck which has severely limited their practical use. In this work, we contribute to improving these issues through the use of neural network compression. We present new findings for both model quantization and pruning, two standard techniques for creating compressed and efficient DNNs. To begin, we examine the efficacy of neural network compression for time series learning, an unstudied modality in model compression literature. We construct a generalized Transformer architecture for multivariate time series which applies both binarization and pruning to model parameters. Our results show that the lightweight models achieve comparable accuracy to dense Transformers of the same structure on time series forecasting, classification, and anomaly detection tasks while significantly reducing the computational burden. Next, we propose two novel algorithms for neural network compression: 1) Tiled Bit Networks (TBNs) and 2) Iterative Weight Recycling (IWR). TBNs present a new form of quantization to tile neural network layers with sequences of bits to achieve sub-bit compression of binary-weighted models. The method learns binary vectors (i.e. tiles) to populate each layer of a model via tensor aggregation and reshaping operations; during inference, TBNs use just a single tile per model layer. TBNs perform well across a diverse range of architecture (CNNs, MLPs, Transformers) and tasks (classification, segmentation) while achieving up to 8x reduction in size compared to binary-weighted models. The second algorithm, IWR, generates sparse neural networks from randomly initialized models by identifying important parameters within neural networks for reuse. The approach enables us to prune 80% of ResNet50's parameters while still achieving 70.8% accuracy on ImageNet. Finally, we examine the feasibility of deploying compressed DNNs in practical applications. Specifically, we deploy Sparse Binary Neural Networks (SBNNs), TBNs, and other common compression algorithms on an embedded device for performance assessment, finding a reduction in both peak memory and storage size. By integrating algorithmic and theoretical advancements into a comprehensive end-to-end methodology, this dissertation contributes a new framework for crafting powerful and efficient deep learning models applicable in real-world settings.
  • ItemOpen Access
    Throughput optimization techniques for heterogeneous architectures
    (Colorado State University. Libraries, 2024) Derumigny, Nicolas, author; Pouchet, Louis-Noël, advisor; Rastello, Fabrice, advisor; Hack, Sebastian, committee member; Rohou, Erven, committee member; Malaiya, Yashwant, committee member; Ortega, Francisco, committee member; Pétrot, Frédéric, committee member; Wilson, James, committee member; Zaks, Ayal, committee member
    Moore's Law has allowed during the past 40 years to exponentially increase transistor density of integrated circuits. As a result, computing devices ranging from general-purpose processors to dedicated accelerators have become more and more complex due to the specialization and the multiplication of their compute units. Therefore, both low-level program optimization (e.g. assembly-level programming and generation) and accelerator design must solve the issue of efficiently mapping the input program computations to the various chip capabilities. However, real-world chip blueprints are not openly accessible in practice, and their documentation is often incomplete. Given the diversity of CPUs available (Intel's / AMD's / Arm's microarchitectures), we tackle in this manuscript the problem of automatically inferring a performance model applicable to fine-grain throughput optimization of regular programs. Furthermore, when order of magnitude of performance gain over generic accelerators are needed, domain-specific accelerators must be considered; which raises the same question of the number of dedicated units as well as their functionality. To remedy this issue, we present two complementary approaches: on one hand, the study of single-application specialized accelerators with an emphasis on hardware reuse, and, on the other hand, the generation of semi-specialized designs suited for a user-defined set of applications.
  • ItemOpen Access
    Typed synthesis of fast multiplication algorithms for post-quantum cryptography
    (Colorado State University. Libraries, 2024) Scarbro, William, author; Rajopadhye, Sanjay, advisor; McClurg, Jedidiah, committee member; Achter, Jeffrey, committee member
    Multiplication over polynomial rings is a time consuming operation in many post-quantum cryptosystems. State-of-the-art implementations of multiplication for these cryptosystems have been developed by hand using an algebraic framework. A similar class of algorithms, based on the Discrete Fourier Transform, have been optimized across a variety of platforms using program synthesis. We demonstrate how the algebraic framework used to describe fast multiplication algorithms can be used in program synthesis. Specifically, we extend and then abstract this framework for use in program synthesis, allowing AI search techniques to find novel, high performance implementations of polynomial ring multiplication across platforms.
  • ItemOpen Access
    Pruning visual transformers to increase model compression and decrease inference time
    (Colorado State University. Libraries, 2024) Yost, James E., author; Whitley, Darrell, advisor; Ghosh, Sudipto, committee member; Betten, Anton, committee member
    We investigate the efficacy of pruning a visual transformer during training to reduce inference time while maintaining accuracy. Various training techniques were explored, including epoch-based training, fixed-time training, and training to achieve a specific accuracy threshold. Results indicate that pruning from the inception of training offers significant reductions in inference time without sacrificing model accuracy. Different pruning rates were evaluated, demonstrating a trade-off between training speed and model compression. Slower pruning rates allowed for better convergence to higher accuracy levels and more efficient model recovery. Furthermore, we examine the cost of pruning and the recovery time of pruned models. Overall, the findings suggest that early-stage pruning strategies can effectively produce smaller, more efficient models with comparable or improved performance compared to non-pruned counterparts, offering insights into optimizing model efficiency and resource utilization in AI applications.
  • ItemOpen Access
    Quality assessment of protein structures using graph convolutional networks
    (Colorado State University. Libraries, 2024) Roy, Soumyadip, author; Ben-Hur, Asa, advisor; Blanchard, Nathaniel, committee member; Zhou, Wen, committee member
    The prediction of protein 3D structure is essential for understanding protein function, drug discovery, and disease mechanisms; with the advent of methods like AlphaFold that are capable of producing very high quality decoys, ensuring the quality of those decoys can provide further confidence in the accuracy of their predictions. In this work we describe Qε, a graph convolutional network that utilizes a minimal set of atom and residue features as input to predict the global distance test total score (GDTTS) and local distance difference test score (lDDT) of a decoy. To improve the model's performance, we introduce a novel loss function based on the ε-insensitive loss function used for SVM-regression. This loss function is specifically designed for the characteristics of the quality assessment problem, and provides predictions with improved accuracy over standard loss functions used for this task. Despite using only a minimal set of features, it matches the performance of recent state-of-the-art methods like DeepUMQA. The code for Qε is available at
  • ItemOpen Access
    Scalable and efficient tools for multi-level tiling
    (Colorado State University. Libraries, 2008) Renganarayana, Lakshminarayanan, author; Rajopadhye, Sanjay, advisor
    In the era of many-core systems, application performance will come from parallelism and data locality. Effective exploitation of these require explicit (re)structuring of the applications. Multilevel (or hierarchical) tiling is one such structuring technique used in almost all high-performance implementations. Lack of tool support has limited the use of multi-level tiling to program optimization experts. We present solutions to two fundamental problems in multi-level tiling, viz., optimal tile size selection and parameterized tiled loop generation. Our solutions provide scalable and efficient tools for multi-level tiling. Parameterized tiled code refers to tiled loops where the tile sizes are not (fixed) compile-time constants but are left as symbolic parameters. It can enable selection and adaptation of tile sizes across a spectrum of stages through compilation to run-time. We introduce two polyhedral sets, viz., inset and outset, and use them to develop a variety of scalable and efficient multi-level tiled loop generation algorithms. The generation efficiency and code quality are demonstrated on a variety of benchmarks such as stencil computations and matrix subroutines from BLAS. Our technique can generate tiled loop nests with parameterized, fixed or mixed tile sizes, thereby providing a one-size-fits all solution ideal for inclusion in production compilers. Optimal tile size selection (TSS) refers to the selection of tile sizes that optimize some cost (e.g., execution time) model. We show that these cost models share a fundamental mathematical property, viz., positivity, that allows us to reduce optimal TSS to convex optimization problems. Almost all TSS models proposed in the literature for parallelism, caches, and registers, lend themselves to this reduction. We present the reduction of five different TSS models proposed in the literature by different authors in a variety of tiling contexts. Our convex optimization based TSS framework is the first one to provide a solution that is both efficient and scalable to multiple levels of tiling.
  • ItemOpen Access
    Improving software maintainability through aspectualization
    (Colorado State University. Libraries, 2009) Mortensen, Michael, author; Ghosh, Sudipto, advisor; Bieman, James M., advisor
    The primary claimed benefits of aspect-oriented programming (AOP) are that it improves the understandability and maintainability of software applications by modularizing cross-cutting concerns. Before there is widespread adoption of AOP, developers need further evidence of the actual benefits as well as costs. Applying AOP techniques to refactor legacy applications is one way to evaluate costs and benefits. Aspect-based refactoring, called aspectualization, involves moving program code that implements cross-cutting concerns into aspects. Such refactoring can potentially improve the maintainability of legacy systems. Long compilation and weave times, and the lack of an appropriate testing methodology are two challenges to the aspectualization of large legacy systems. We propose an iterative test driven approach for creating and introducing aspects. The approach uses mock systems that enable aspect developers to quickly experiment with different pointcuts and advice, and reduce the compile and weave times. The approach also uses weave analysis, regression testing, and code coverage analysis to test the aspects. We developed several tools for unit and integration testing. We demonstrate the test driven approach in the context of large industrial C++ systems, and we provide guidelines for mock system creation. This research examines the effects on maintainability of replacing cross-cutting concerns with aspects in three industrial applications. We study several revisions of each application, identifying cross-cutting concerns in the initial revision, and also cross-cutting concerns that are added in later revisions. Aspectualization improved maintainability by reducing code size and improving both change locality and concern diffusion. Costs include the effort required for application refactoring and aspect creation, as well as a small decrease in performance.
  • ItemOpen Access
    Exploring the bias of direct search and evolutionary optimization
    (Colorado State University. Libraries, 2008) Lunacek, Monte, author; Whitley, Darrell, advisor
    There are many applications in science that yield the following optimization problem: given an objective function, which set of input decision variables produce the largest or smallest result? Optimization algorithms attempt to answer this question by searching for competitive solutions within an application's domain. But every search algorithm has some particular bias. Our results show that search algorithms are more effective when they cope with the features that make a particular application difficult. Evolutionary algorithms are stochastic population-based search methods that are often designed to perform well on problems containing many local optima. Although this is a critical feature, the number of local optima in the search space is not necessarily indicative of problem difficulty. The objective of this dissertation is to investigate how two relatively unexplored problem features, ridges and global structure, impact the performance of evolutionary parameter optimization. We show that problems containing these features can cause evolutionary algorithms to fail in unexpected ways. For example, the condition number of a problem is one way to quantify a ridge feature. When a simple unimodal surface has a high condition number, we show that the resulting narrow ridge can make many evolutionary algorithms extremely inefficient. Some even fail. Similarly, funnels are one way of categorizing a problem's global structure. A single-funnel problem is one where the local optima are clustered together such that there exists a global trend toward the best solution. This trend is less predicable on problems that contain multiple funnels. We describe a metric that distinguishes problems based on this characteristic. Then we show that the global structure of the problem can render successful global search strategies ineffective on relatively simple multi-modal surfaces. Our proposed strategy that performs well on problems with multiple funnels is counter-intuitive. These issues impact two real-world applications: an atmospheric science inversion model and a configurational chemistry problem. We find that exploiting ridges and global structure results in more effective solutions on these difficult real-world problems. This adds integrity to our perspective on how problem features interact with search algorithms, and more clearly exposes the bias of direct search and evolutionary algorithms.
  • ItemOpen Access
    Stability analysis of recurrent neural networks with applications
    (Colorado State University. Libraries, 2008) Knight, James N., author; Anderson, Charles W., advisor
    Recurrent neural networks are an important tool in the analysis of data with temporal structure. The ability of recurrent networks to model temporal data and act as dynamic mappings makes them ideal for application to complex control problems. Because such networks are dynamic, however, application in control systems, where stability and safety are important, requires certain guarantees about the behavior of the network and its interaction with the controlled system. Both the performance of the system and its stability must be assured. Since the dynamics of controlled systems are never perfectly known, robust control requires that uncertainty in the knowledge of systems be explicitly addressed. Robust control synthesis approaches produce controllers that are stable in the presence of uncertainty. To guarantee robust stability, these controllers must often sacrifice performance on the actual physical system. The addition of adaptive recurrent neural network components to the controller can alleviate, to some extent, the loss of performance associated with robust design by allowing adaptation to observed system dynamics. The assurance of stability of the adaptive neural control system is prerequisite to the application of such techniques. Work in [49, 2] points toward the use of modern stability analysis and robust control techniques in combination with reinforcement learning algorithms to provide adaptive neural controllers with the necessary guarantees of performance and stability. The algorithms developed in these works have a high computational burden due to the cost of the online stability analysis. Conservatism in the stability analysis of the adaptive neural components has a direct impact on the cost of the proposed system. This is due to an increase in the number of stability analysis computations that must be made. The work in [79, 82] provides more efficient tools for the analysis of time-varying recurrent neural network stability than those applied in [49, 2]. Recent results in the analysis of systems with repeated nonlinearities [19, 52, 17] can reduce the conservatism of the analysis developed in [79] and give an overall improvement in the performance of the on-line stability analysis. In this document, steps toward making the application of robust adaptive neural controllers practical are described. The analysis of recurrent neural network stability in [79] is not exact and reductions in the conservatism and computational cost of the analysis are presented. An algorithm is developed allowing the application of the stability analysis results to online adaptive control systems. The algorithm modifies the recurrent neural network updates with a bias away from the boundary between provably stable parameter settings and possibly unstable settings. This bias is derived from the results of the stability analysis, and its method of computation is applicable to a broad class of adaptive control systems not restricted to recurrent neural networks. The use of this bias term reduces the number of expensive stability analysis computations that must be made and thus reduces the computational complexity of the stable adaptive system. An application of the proposed algorithm to an uncertain, nonlinear, control system is provided and points toward future work on this problem that could further the practical application of robust adaptive neural control.
  • ItemOpen Access
    Decay and grime buildup in evolving object oriented design patterns
    (Colorado State University. Libraries, 2009) Izurieta, Clemente, author; Bieman, James M., advisor
    Software designs decay as systems, uses, and operational environments evolve. As software ages the original realizations of design patterns may remain in place, while participants in design pattern realizations accumulate grime-non-pattern-related code. This research examines the extent to which software designs actually decay, rot and accumulate grime by studying the aging of design patterns in successful object oriented systems. By focusing on design patterns we can identify code constructs that conflict with well formed pattern structures. Design pattern rot is the deterioration of the structural integrity of a design pattern realization. Grime buildup in design patterns is a form of decay that does not break the structural integrity of a pattern but can reduce system testability and adaptability. Grime is measured using various types of indices developed and adapted for this research. Grime indices track the internal structural changes in a design pattern realization and the code that surrounds the realization. In general we find that the original pattern functionality remains, and pattern decay is primarily due to grime and not rot. We characterize the nature of grime buildup in design patterns, provide quantifiable evidence of such grime buildup, and find that grime can be classified at organizational, modular and class levels. Organizational level grime refers to namespace and physical file constitution and structure. Metrics at this level help us understand if rot and grime buildup play a role in fomenting disorganization of design patterns. Measures of modular level grime can help us to understand how the coupling of classes belonging to a design pattern develops. As dependencies between design pattern components increase without regard for pattern intent, the modularity of a pattern deteriorates. Class level grime is focused on understanding how classes that participate in design patterns are modified as systems evolve. For each level we use different measurements and surrogate indicators to help analyze the consequences that grime buildup has on testability and adaptability of design patterns. Test cases put in place during the design phase and initial implementation of a project can become ineffective as the system matures. The evolution of a design due to added functionality or defect fixing increases the coupling and dependencies between classes that must be tested. We show that as systems age, the growth of grime and the appearance of anti-patterns (a form of decay) increase testing requirements. Additionally, evidence suggests that, as pattern realizations evolve, the levels of efferent and afferent coupling of the classifiers that participate in patterns increase. Increases in coupling measurements suggest dependencies to and from other software artifacts thus reducing the adaptability and comprehensibility of the pattern. In general we find that grime buildup is most serious at a modular level. We find little evidence of class and organizational grime. Furthermore, we find that modular grime appears to have higher impacts on testability than adaptability of design patterns. Identifying grime helps developers direct refactoring efforts early in the evolution of software, thus keeping costs in check by minimizing the effects of software aging. Long term goals of this research are to curtail the effects of decay by providing the understanding and means necessary to diminish grime buildup.
  • ItemOpen Access
    A systematic approach to testing UML designs
    (Colorado State University. Libraries, 2007) Dinh-Trong, Trung T., author; France, Robert B., advisor; Ghosh, Sudipto, advisor
    In Model Driven Engineering (MDE) approaches, developers create and refine design models from which substantial portions of implementations are generated. During refinement, undetected faults in an abstract model can traverse into the refined models, and eventually into code. Hence, finding and removing faults in design models is essential for MDE approaches to succeed. This dissertation describes a testing approach to finding faults in design models created using the Unified Modeling Language (UML). Executable forms of UML design models are exercised using generated test inputs that provide coverage with respect to UML-based coverage criteria. The UML designs that are tested consist of class diagrams, sequence diagrams and activity diagrams. The contribution of the dissertation includes (1) a test input generation technique, (2) an approach to execute design models describing sequential behavior with test inputs in order to detect faults, and (3) a set of pilot studies that are carried out to explore the fault detection capability of our testing approach. The test input generation technique involves analyzing design models under test to produce test inputs that satisfy UML sequence diagram coverage criteria. We defined a directed graph structure, named Variable Assignment Graph (VAG), to generate test inputs. The VAG combines information from class and sequence diagrams. Paths are selected from the VAG and constraints are identified to traverse the paths. The constraints are then solved with a constraint solver. The model execution technique involves transforming each design under test into an executable form, which is exercised with the generated inputs. Failures are reported if the observed behavior differs from the expected behavior. We proposed an action language, named Java-like Action Language (JAL), that supports the UML action semantics. We developed a prototype tool, named UMLAnT, that performs test execution and animation of design models. We performed pilot studies to evaluate the fault detection effectiveness of our approach. Mutation faults and commonly occurring faults in UML models created by students in our software engineering courses were seeded in three design models. Ninety percent of the seeded faults were detected using our approach.
  • ItemOpen Access
    A vector model of trust to reason about trustworthiness of entities for developing secure systems
    (Colorado State University. Libraries, 2008) Chakraborty, Sudip, author; Ray, Indrajit, advisor; Ray, Indrakshi, advisor
    Security services rely to a great extent on some notion of trust. In all security mechanisms there is an implicit notion of trustworthiness of the involved entities. Security technologies like cryptographic algorithms, digital signature, access control mechanisms provide confidentiality, integrity, authentication, and authorization thereby allow some level of 'trust' on other entities. However, these techniques provide only a restrictive (binary) notion of trust and do not suffice to express more general concept of 'trustworthiness'. For example, a digitally signed certificate does not tell whether there is any collusion between the issuer and the bearer. In fact, without a proper model and mechanism to evaluate and manage trust, it is hard to enforce trust-based security decisions. Therefore there is a need for more generic model of trust. However, even today, there is no accepted formalism for specifying and reasoning with trust. Secure systems are built under the premise that concepts like "trustworthiness" or "trusted" are well understood, without agreeing to what "trust" means, what constitutes trust, how to measure it, how to compare or compose two trusts, and how a computed trust can help to make a security decision.
  • ItemOpen Access
    Assessing vulnerabilities in software systems: a quantitative approach
    (Colorado State University. Libraries, 2007) Alhazmi, Omar, author; Malaiya, Yashwant K., advisor; Ray, Indrajit, advisor
    Security and reliability are two of the most important attributes of complex software systems. It is now common to use quantitative methods for evaluating and managing reliability. Software assurance requires similar quantitative assessment of software security, however only limited work has been done on quantitative aspects of security. The analogy with software reliability can help developing similar measures for software security. However, there are significant differences that need to be identified and appropriately acknowledged. This work examines the feasibility of quantitatively characterizing major attributes of security using its analogy with reliability. In particular, we investigate whether it is possible to predict the number of vulnerabilities that can potentially be identified in a current or future release of a software system using analytical modeling techniques.
  • ItemOpen Access
    Towards interactive analytics over voluminous spatiotemporal data using a distributed, in-memory framework
    (Colorado State University. Libraries, 2023) Mitra, Saptashwa, author; Pallickara, Sangmi Lee advisor; Pallickara, Shrideep, committee member; Ortega, Francisco, committee member; Li, Kaigang, committee member
    The proliferation of heterogeneous data sources, driven by advancements in sensor networks, simulations, and observational devices, has reached unprecedented levels. This surge in data generation and the demand for proper storage has been met with extensive research and development in distributed storage systems, facilitating the scalable housing of these voluminous datasets while enabling analytical processes. Nonetheless, the extraction of meaningful insights from these datasets, especially in the context of low-latency/ interactive analytics, poses a formidable challenge. This arises from the persistent gap between the processing capacity of distributed systems and their ever-expanding storage capabilities. Moreover, the interactive querying of these datasets is hindered by disk I/O, redundant network communications, recurrent hotspots, transient surges of user interest over limited geospatial regions, particularly in systems that concurrently serve multiple users. In environments where interactive querying is paramount, such as visualization systems, addressing these challenges becomes imperative. This dissertation delves into the intricacies of enabling interactive analytics over large-scale spatiotemporal datasets. My research efforts are centered around the conceptualization and implementation of a scalable storage, indexing, and caching framework tailored specifically for spatiotemporal data access. The research aims to create frameworks to facilitate fast query analytics over diverse data-types ranging from point, vector, and raster datasets. The frameworks implemented are characterized by its lightweight nature, residence primarily in memory, and their capacity to support model-driven extraction of insights from raw data or dynamic reconstruction of compressed/ partial in-memory data fragments with an acceptable level of accuracy. This approach effectively helps reduce the memory footprint of cached data objects and also mitigates the need for frequent client-server communications. Furthermore, we investigate the potential of leveraging various transfer learning techniques to improve the turn-around times of our memory-resident deep learning models, given the voluminous nature of our datasets, while maintaining good overall accuracy over its entire spatiotemporal domain. Additionally, our research explores the extraction of insights from high-dimensional datasets, such as satellite imagery, within this framework. The dissertation is also accompanied by empirical evaluations of our frameworks as well as the future directions and anticipated contributions in the domain of interactive analytics over large-scale spatiotemporal datasets, acknowledging the evolving landscape of data analytics where analytics frameworks increasingly rely on compute-intensive machine learning models.
  • ItemOpen Access
    Subnetwork ensembles
    (Colorado State University. Libraries, 2023) Whitaker, Timothy J., author; Whitley, Darrell, advisor; Anderson, Charles, committee member; Krishnaswamy, Nikhil, committee member; Kirby, Michael, committee member
    Neural network ensembles have been effectively used to improve generalization by combining the predictions of multiple independently trained models. However, the growing scale and complexity of deep neural networks have led to these methods becoming prohibitively expensive and time consuming to implement. Low-cost ensemble methods have become increasingly important as they can alleviate the need to train multiple models from scratch while retaining the generalization benefits that traditional ensemble learning methods afford. This dissertation introduces and formalizes a low-cost framework for constructing Subnetwork Ensembles, where a collection of child networks are formed by sampling, perturbing, and optimizing subnetworks from a trained parent model. We explore several distinct methodologies for generating child networks and we evaluate their efficacy through a variety of ablation studies and established benchmarks. Our findings reveal that this approach can greatly improve training efficiency, parametric utilization, and generalization performance while minimizing computational cost. Subnetwork Ensembles offer a compelling framework for exploring how we can build better systems by leveraging the unrealized potential of deep neural networks.
  • ItemEmbargo
    Automated extraction of access control policy from natural language documents
    (Colorado State University. Libraries, 2023) Alqurashi, Saja, author; Ray, Indrakshi, advisor; Ray, Indrajit, committee member; Malaiya, Yashwant, committee member; Simske, Steve, committee member
    Data security and privacy are fundamental requirements in information systems. The first step to providing data security and privacy for organizations is defining access control policies (ACPs). Security requirements are often expressed in natural languages, and ACPs are embedded in the security requirements. However, ACPs in natural language are unstructured and ambiguous, so manually extracting ACPs from security requirements and translating them into enforceable policies is tedious, complex, expensive, labor-intensive, and error-prone. Thus, the automated ACPs specification process is crucial. In this thesis, we consider the Next Generation Access Control (NGAC) model as our reference formal access control model to study the automation process. This thesis addresses the research question: How do we automatically translate access control policies (ACPs) from natural language expression to the NGAC formal specification? Answering this research question entails building an automated extraction framework. The pro- posed framework aims to translate natural language ACPs into NGAC specifications automatically. The primary contributions of this research are developing models to construct ACPs in NGAC specification from natural language automatically and generating a realistic synthetic dataset of access control policies sentences to evaluate the proposed framework. Our experimental results are promising as we achieved, on average, an F1-score of 93 % when identifying ACPs sentences, an F1-score of 96 % when extracting NGAC relations between attributes, and an F1-score of 96% when extracting user attribute and 89% for object attribute from natural language access control policies.
  • ItemOpen Access
    Machine learning-based phishing detection using URL features: a comprehensive review
    (Colorado State University. Libraries, 2023) Asif, Asif Uz Zaman, author; Ray, Indrakshi, advisor; Shirazi, Hossein, advisor; Ray, Indrajit, committee member; Wang, Haonan, committee member
    In a social engineering attack known as phishing, a perpetrator sends a false message to a victim while posing as a trusted representative in an effort to collect private information such as login passwords and financial information for personal gain. To successfully carry out a phishing attack, fraudulent websites, emails, and messages that are counterfeit are utilized to trick the victim. Machine learning appears to be a promising technique for phishing detection. Typically, website content and Unified Resource Locator (URL) based features are used. However, gathering website content features requires visiting malicious sites, and preparing the data is labor-intensive. Towards this end, researchers are investigating if URL-only information can be used for phishing detection. This approach is lightweight and can be installed at the client's end, they do not require data collection from malicious sites and can identify zero-day attacks. We conduct a systematic literature review on URL-based phishing detection. We selected recent papers (2018 --) or if they had a high citation count (50+ in Google Scholar) that appeared in top conferences and journals in cybersecurity. This survey will provide researchers and practitioners with information on the current state of research on URL-based website phishing attack detection methodologies. The results of this study show that, despite the lack of a centralized dataset, this is beneficial because it prevents attackers from seeing the features that classifiers employ. However, the approach is time-consuming for researchers. Furthermore, for algorithms, both machine learning and deep learning algorithms can be utilized since they have very good classification accuracy, and in this work, we found that Random Forest and Long Short-Term Memory are good choices of algorithms. Using task-specific lexical characteristics rather than concentrating on the number of features is essential for this work because feature selection will impact how accurately algorithms will detect phishing URLs.