Department of Computer Science
Permanent URI for this community
This digital collection contains faculty/student publications, theses, and dissertations from the Department of Computer Science.
Browse
Browsing Department of Computer Science by Title
Now showing 1 - 20 of 323
Results Per Page
Sort Options
Item Open Access A comparative analysis of alternative splicing in A. Thaliana and C. Reinhardtii(Colorado State University. Libraries, 2010) Labadorf, Adam, author; Ben-Hur, Asa, advisor; Rajopadhye, Sanjay Vishnu, committee member; Reddy, Anireddy S. N., committee memberThe extent of and mechanisms causing alternative splicing in plants are not currently well understood. A recent study in the model organism Arabidopsis thaliana estimates that approximately 42% of intron-containing genes are alternatively spliced and it is speculated that this number may be much higher. Results from our previous studies showed that the single celled alga Chlamydomonas reinhardtii also exhibits alternative splicing characteristic of plants. In this work we present the results of a comprehensive alternative splicing analysis using the largest Expressed Sequence Tag (EST) datasets available for both of these organisms, describe an analysis pipeline tailored to these large datasets, and conduct a cross-organism comparative analysis of aspects related to alternative splicing.Item Open Access A compiler for hierarchical task-based programming on distributed-memory(Colorado State University. Libraries, 2022) Dubois, Alexandre, author; Pouchet, Louis-Noël, advisor; Rajopadhye, Sanjay, committee member; Wilson, James, committee memberToday, computation intensive applications are run on heterogeneous clusters of machines and use the Message Passing Interface (MPI), which provides a library interface for message-passing between non-shared memory computational resources but comes at a high application development cost. Task-based programming, such as the Concurrent Collection (CnC) model, makes parallelism implicit by only describing task dependencies. This model has recently been extended to model programs with a hierarchical task-based representation, which allows to view tasks at different levels of decomposition, allowing to dispatch tasks of different level optimally to the different available architectures. This thesis main work was to design an algorithm following a graph-based approach to transform a restricted class of single-level regular kernel to a two-level representation in the CnC model extended with hierarchical concepts. This transformation will alleviate performance boost by reducing communication cost and allowing the use of optimized tasks implementation at a coarse level. After describing the CnC programming model concepts, the structure of the proposed graph based tiling algorithm will be developed. Then, the compiler implementing this algorithm on an Intermediate Representation representing a CnC program and generating a C++ version of the kernel using a new hierarchical CnC runtime. Finally, the overhead of this runtime on a shared memory system for a 3D synthetic benchmark representing a classic linear-algebra dependency pattern. This characterization is done to help the user choose the target volume of tasks tile that has to be given as input of the tiling algorithm. The main recommendation is to target a minimization of the number of super- tasks in the runtime while keeping the number of sub-tasks in every super-task under the order of 10000 sub-tasks.Item Open Access A comprehensive compendium of Arabidopsis RNA-seq data(Colorado State University. Libraries, 2020) Halladay, Gareth A., author; Ben-Hur, Asa, advisor; Chitsaz, Hamidreza, committee member; Reddy, Anireddy, committee memberIn the last fifteen years, the amount of publicly available genomic sequencing data has doubled every few months. Analyzing large collections of RNA-seq datasets can provide insights that are not available when analyzing data from single experiments. There are barriers towards such analyses: combining processed data is challenging because varying methods for processing data make it difficult to compare data across studies; combining data in raw form is challenging because of the resources needed to process the data. Multiple RNA-seq compendiums, which are curated sets of RNA-seq data that have been pre-processed in a uniform fashion, exist; however, there is no such resource in plants. We created a comprehensive compendium for Arabidopsis thaliana using a pipeline based on Snakemake. We downloaded over 80 Arabidopsis studies from the Sequence Read Archive. Through a strict set of criteria, we chose 35 studies containing a total of 700 biological replicates, with a focus on the response of different Arabidopsis tissues to a variety of stresses. In order to make the studies comparable, we hand-curated the metadata, pre-processed and analyzed each sample using our pipeline. We performed exploratory analysis on the samples in our compendium for quality control, and to identify biologically distinct subgroups, using PCA and t-SNE. We discuss the differences between these two methods and show that the data separates primarily by tissue type, and to a lesser extent, by the type of stress. We identified treatment conditions for each study and generated three lists: differentially expressed genes, differentially expressed introns, and genes that were differentially expressed under multiple conditions. We then visually analyzed these groups, looking for overarching patterns within the data, finding around a thousand genes that participate in stress response across tissues and stresses.Item Open Access A domain-protocol mapping based middleware for distributed application development(Colorado State University. Libraries, 2014) Mandalaparty, Sai Pradeep, author; France, Robert, advisor; Rajopadhye, Sanjay, committee member; Young, Peter, committee memberDistributed systems such as Internet of Things, Sensor Networks and Networked Control Systems are being used in various application domains, including industrial, environmental, medical and energy management domains. A distributed application in these domains may need to access data from different devices, where they may all be of the same type or a combination of different types. In addition, these devices may communicate through standardized protocols or proprietary interfaces. The development of such a distributed application may also require a team of developers with expertise in different disciplines. Therefore, the application development that involves heterogeneous devices and multidisciplinary teams can be made more effective by introducing an interface layer that shields developers from aspects of software and hardware heterogeneity. This work proposes a 'domain-protocol mapping' technique that is implemented as a middleware framework. The proposed mapping method maps the application data schema represented as object-oriented domain object to the appropriate communication protocol packet data and also updates the domain object from the response packet data. The middleware provides APIs for the domain experts to read the data from the device or to write the data to the device. The marshalling and unmarshalling process of the domain objects are hidden from the domain expert who may or may not be a software engineer. The use of the developed middleware is illustrated in two case-studies, one involving a simulation of distributed network controls for power system and the other involving integration of different types of power meters in power monitoring application.Item Open Access A framework for profiling spatial variability in the performance of classification models(Colorado State University. Libraries, 2024-04-03) Warushavithana, Menuka, author; Barram, Kassidy, author; Carlson, Caleb, author; Mitra, Saptashwa, author; Ghosh, Sudipto, author; Breidt, Jay, author; Pallickara, Sangmi Lee, author; Pallickara, Shrideep, author; ACM, publisherScientists use models to further their understanding of phenomena and inform decision-making. A confluence of factors has contributed to an exponential increase in spatial data volumes. In this study, we describe our methodology to identify spatial variation in the performance of classification models. Our methodology allows tracking a host of performance measures across different thresholds for the larger, encapsulating spatial area under consideration. Our methodology ensures frugal utilization of resources via a novel validation budgeting scheme that preferentially allocates observations for validations. We complement these efforts with a browser-based, GPU-accelerated visualization scheme that also incorporates support for streaming to assimilate validation results as they become available.Item Open Access A framework for real-time, autonomous anomaly detection over voluminous time-series geospatial data streams(Colorado State University. Libraries, 2014) Budgaga, Walid, author; Pallickara, Shrideep, advisor; Pallickara, Sangmi Lee, advisor; Ben-Hur, Asa, committee member; Schumacher, Russ, committee memberIn this research work we present an approach encompassing both algorithm and system design to detect anomalies in data streams. Individual observations within these streams are multidimensional, with each dimension corresponding to a feature of interest. We consider time-series geospatial datasets generated by remote and in situ observational devices. Three aspects make this problem particularly challenging: (1) the cumulative volume and rates of data arrivals, (2) anomalies evolve over time, and (3) there are spatio-temporal correlations associated with the data. Therefore, anomaly detections must be accurate and performed in real time. Given the data volumes involved, solutions must minimize user intervention and be amenable to distributed processing to ensure scalability. Our approach achieves accurate, high throughput classications in real time. We rely on Expectation Maximization (EM) to build Gaussian Mixture Models (GMMs) that model the densities of the training data. Rather than one all-encompassing model, our approach involves multiple model instances, each of which is responsible for a particular geographical extent and can also adapt as data evolves. We have incorporated these algorithms into our distributed storage platform, Galileo, and proled their suitability through empirical analysis which demonstrates high throughput (10,000 observations per-second, per-node) and low latency on real-world datasets.Item Open Access A framework for resource efficient profiling of spatial model performance(Colorado State University. Libraries, 2022) Carlson, Caleb, author; Pallickara, Shrideep, advisor; Pallickara, Sangmi Lee, advisor; Adams, Henry, committee memberWe design models to understand phenomena, make predictions, and/or inform decision-making. This study targets models that encapsulate spatially evolving phenomena. Given a model M, our objective is to identify how well the model predicts across all geospatial extents. A modeler may expect these validations to occur at varying spatial resolutions (e.g., states, counties, towns, census tracts). Assessing a model with all available ground-truth data is infeasible due to the data volumes involved. We propose a framework to assess the performance of models at scale over diverse spatial data collections. Our methodology ensures orchestration of validation workloads while reducing memory strain, alleviating contention, enabling concurrency, and ensuring high throughput. We introduce the notion of a validation budget that represents an upper-bound on the total number of observations that are used to assess the performance of models across spatial extents. The validation budget attempts to capture the distribution characteristics of observations and is informed by multiple sampling strategies. Our design allows us to decouple the validation from the underlying model-fitting libraries to interoperate with models designed using different libraries and analytical engines; our advanced research prototype currently supports Scikit-learn, PyTorch, and TensorFlow. We have conducted extensive benchmarks that demonstrate the suitability of our methodology.Item Open Access A GPU accelerated RNA-RNA interaction program(Colorado State University. Libraries, 2021) Gildemaster, Brandon, author; Rajopadhye, Sanjay, advisor; Chitsaz, Hamidreza, committee member; Abdo, Zaid, committee memberRNA-RNA interaction (RRI) is important in processes like gene regulation, and is known to play roles in diseases including cancer and Alzheimer's. Large RRI computations run for days, weeks or even months, because the algorithms have time and space complexity of, respectively, O(N3M3) and O(N2M2), for sequences length N and M, and there is a need for high-throughput RRI tools. GPU parallelization of such algorithms is a challenge. We first show that the most computationally expensive part of base pair maximization (BPM) algorithms comprises O(N3) instances of upper banded tropical matrix products. We develop the first GPU library for this attaining close to theoretical machine peak (TMP). We next optimize other (fifth degree polynomial) terms in the computation and develop the first GPU implementation of the complete BPMax algorithm. We attain 12% of GPU TMP, a significant speedup over the original parallel CPU implementation, which attains less than 1% of CPU TMP. We also perform a large scale study of three small viral RNAs, hypothesized to be relevant to COVID-19.Item Open Access A heuristic-based approach to automatically extract personalized attack graph related concepts from vulnerability descriptions(Colorado State University. Libraries, 2017) Mukherjee, Subhojeet, author; Ray, Indrajit, advisor; Ray, Indrakshi, committee member; Byrne, Zinta, committee memberComputer users are not safe, be it at home or in public places. Public networks are more often administered by trained individuals who attempt to fortify those networks using strong administrative skills, state-of-the-art security tools and meticulous vigilance. This is, however, not true for home computer users. Being largely untrained they are often the most likely targets of cyber attacks. These attacks are often executed in cleverly interleaved sequences leading to the eventual goal of the attacker. The Personalized Attack Graphs (PAG) introduced by Ubranska et al. [24, 25, 32] can leverage the interplay of system configurations, attacker and user actions to represent a cleverly interleaved sequence of attacks on a single system. An instance of the PAG can be generated manually by observing system configurations of a computer and collating them with possible security threats which can exploit existing system vulnerabilities and/or misconfigurations. However, the amount of manual labor involved in creating and periodically updating the PAG can be very high. As a result, attempt should be made to automate the process of generating the PAG. Information required to generate these graphs are available on the Internet in the form of vulnerability descriptions. This information is, however, almost always written in natural language and lacks any form of structure. In this thesis, we propose an unsupervised heuristic-based approach which parses vulnerability descriptions and extracts instances of PAG related concepts like system configurations, attacker and user actions. Extracted concepts can then be interleaved to generate the Personalized Attack Graph.Item Open Access A hybrid model checking approach to analysing rule conformance applied to HIPAA privacy rules(Colorado State University. Libraries, 2017) Bennett, Phillipa, author; Bieman, James, advisor; Georg, Geri, advisor; Turk, Daniel, committee member; Ghosh, Sudipto, committee memberMany of today's computing systems must show evidence of conformance to rules. The rules may come from business protocol choices or from multi-jurisdictional sources. Some examples are the rules that come from the regulations in the Health Insurance Portability and Accountability Act (HIPAA) protecting the privacy of patient information and the Family Educational Rights and Privacy Act (FERPA) protecting the privacy of student education records. The rules impose additional requirements on already complex systems, and rigorous analysis is needed to show that any system implementing the rules exhibit conformance. If the analysis finds that a rule is not satisfied, we adjudge that the system fails conformance analysis and that it contains a fault, and this fault must be located in the system and fixed. The exhaustive analysis performed by Model Checking makes it suitable for showing that systems satisfy conformance rules. Conformance rules may be viewed in two, sometimes overlapping, categories: process- aware conformance rules that dictate process sequencing, and data-aware conformance rules that dictate acceptable system states. Where conformance rules relate to privacy, the analysis performed in model check- ing requires the examination of fine-grained structural details in the system state for showing conformance to data-aware conformance rules. The analysis of these rules may cause model checking to be intractable due to a state space explosion when there are too many system states or too many details in a system state. To over- come this intractable complexity, various abstraction techniques have been proposed that achieve a smaller abstracted system state model that is more amenable to model checking. These abstraction techniques are not useful when the abstractions hide the details necessary to verify conformance. If non-conformance occurs, the abstraction may not allow isolation of the fault. In this dissertation, we introduce a Hybrid Model Checking Approach (HMCA) to analyse a system for both process- and data-aware conformance rules without abstracting the details from a system's detailed process- and data models. Model Checking requires an analysable model of the system under analysis called a program graph and a representation of the rules that can be checked on the program graph. In our approach, we use connections between a process-oriented (e.g. a Unified Modelling Language (UML) activity model) and a data-oriented (e.g. UML class model) to create a unified paths-and-state system model. We represent this unified model as a UML state machine. The rule-relevant part of the state machine along with a graph-oriented formalism of the rules are the inputs to HMCA. The model checker uses an exhaustive unfolding of the program graph to produce a transition system showing all the program graph's reachable paths and states. Intractable complexity during model checking is encountered when trying to create the transition system. In HMCA, we use a divide and conquer approach that applies a slicing technique on the program graph to semi- automatically produce the transition system by analysing each slice individually, and composing its result with the results from other slices. Our ability to construct the transition system from the slices relieves a traditional model checker of that step. We then return to use model checking techniques to verify whether the transition system satisfies the rules. Since the analysis involves examining system states, if any of the rules are not satisfied, we can isolate the specific location of the fault from the details contained in the slices. We demonstrate our technique on an instance of a medical research system whose requirements include the privacy rules mandated by HIPAA. Our technique found seeded faults for common mistakes in logic that led to non-conformance and underspecification leading to conflicts of interests in personnel relationships.Item Open Access A locality-aware scientific workflow engine for fast-evolving spatiotemporal sensor data(Colorado State University. Libraries, 2017) Kachikaran Arulswamy, Johnson Charles, author; Pallickara, Sangmi Lee, advisor; Pallickara, Shrideep, committee member; von Fischer, Joseph, committee memberDiscerning knowledge from voluminous data involves a series of data manipulation steps. Scientists typically compose and execute workflows for these steps using scientific workflow management systems (SWfMSs). SWfMSs have been developed for several research communities including but not limited to bioinformatics, biology, astronomy, computational science, and physics. Parallel execution of workflows has been widely employed in SWfMSs by exploiting the storage and computing resources of grid and cloud services. However, none of these systems have been tailored for the needs of spatiotemporal analytics on real-time sensor data with high arrival rates. This thesis demonstrates the development and evaluation of a target-oriented workflow model that enables a user to specify dependencies among the workflow components, including data availability. The underlying spatiotemporal data dispersion and indexing scheme provides fast data search and retrieval to plan and execute computations comprising the workflow. This work includes a scheduling algorithm that targets minimizing data movement across machines while ensuring fair and efficient resource allocation among multiple users. The study includes empirical evaluations performed on the Google cloud.Item Open Access A meta-modeling approach to specifying patterns(Colorado State University. Libraries, 2004) Kim, Dae-Kyoo, author; France, Robert B., advisor; Bieman, James M., committee member; Ghosh, Sudipto, committee member; Turk, Daniel E., committee memberA major goal in software development is to produce quality products in less time and with less cost. Systematic reuse of software artifacts that encapsulate high-quality development experience can help one achieve the goal. Design patterns are a common form of reusable design experience that can help developers reduce development time. Prevalent design patterns are, however, described informally (e.g., [35]). This prevents systematic use of patterns. The research documented in this dissertation is aimed at developing a practical pattern specification technique that supports the systematic use of patterns during design modeling. A pattern specification language called the Role-Based Metamodeling Language (RBML) was developed as part of this research. The RBML specifies a pattern as a specialization of the UML metamodel. The RBML uses the Unified Modeling Language (UML) as a syntactic base to enable the use of UML modeling tools for creating and evolving pattern specifications. We used the RBML to develop specifications for design patterns in the Design Patterns book [35] including Abstract Factory, Bridge, Decorator, Observer, State, Iterator, and Visitor. We also used the RBML to define a large application domain pattern for checkin-checkout systems, and used the resulting specification to develop UML designs for a library system and a car rental system. In addition, we used the RBML to specify access control mechanisms as patterns including Role-Based Access Control (RBAC), Mandatory Access Control (MAC), and a Hybrid Access Control (HAC) that is a composition of RBAC and MAC. The RBML is currently used at NASA for modeling pervasive requirements as patterns. NASA also uses the RBML in the development of Weather CTAS System that is a weather forecasting system. To determine the potential of the RBML to support the development of tools that enable systematic use of patterns, we developed a prototype tool called RBMLPattern Instantiator (RBML-PI) that generates conforming UML models from RBML pattern specifications.Item Open Access A methodology for automated lookup table optimization of scientific applications(Colorado State University. Libraries, 2012) Wilcox, Chris, author; Strout, Michelle Mills, advisor; Bieman, James M., advisor; Bohm, Anton P. W., committee member; Turk, Daniel, committee memberTuning the performance of scientific codes is challenging because of their math-intensive nature. Applications such as climate modeling and molecular biology simulate the behavior of natural systems based on scientific equations. Translating these equations into code can often yield expressions that are expensive to evaluate. Trigonometric, logarithmic, and exponential elementary functions are especially problematic because of their high cost relative to ordinary arithmetic. Lookup table (LUT) transformation can expedite function evaluation by precomputing and storing function results, thereby allowing inexpensive memory lookups to replace costly function calls. Practical concerns limit the size and accuracy of LUT data, thus the technique represents a tradeoff between error and performance. Current practice has the programmer apply each LUT transform manually, thereby impacting productivity, obfuscating code, and limiting programmer control over accuracy and performance. The goal of our research is to help scientific programmers use LUT techniques in a more effective and productive manner. Our approach substantially automates the process of applying LUT transformation via a methodology and its implementation in the Mesa tool. Mesa incorporates algorithms that make adding a LUT transform easier for the programmer, including expression enumeration, domain profiling, error analysis, performance modeling, and code generation. In addition we introduce a novel algorithm for LUT optimization that analyzes application code and helps the programmer identify the most beneficial set of expressions to transform. We demonstrate the effectiveness of our algorithms by using Mesa to perform case studies of scientific applications. Mesa achieves performance speedups of 1.4X to 6.9X without significantly compromising application accuracy. Finally we present evidence that our methodology allows the scientific programmer to be more productive and effective.Item Open Access A methodology for evaluating multimodal referring expression generation for embodied virtual agents(Colorado State University. Libraries, 2023-10-09) Alalyani, Nada, author; Krishnaswamy, Nikhil, author; ACM, publisherRobust use of definite descriptions in a situated space often involves recourse to both verbal and non-verbal modalities. For IVAs, virtual agents designed to interact with humans, the ability to both recognize and generate non-verbal and verbal behavior is a critical capability. To assess how well an IVA is able to deploy multimodal behaviors, including language, gesture, and facial expressions, we propose a methodology to evaluate the agent's capacity to generate object references in a situational context, using the domain of multimodal referring expressions as a use case. Our contributions include: 1) developing an embodied platform to collect human referring expressions while communicating with the IVA. 2) comparing human and machine-generated references in terms of evaluable properties using subjective and objective metrics. 3) reporting preliminary results from trials that aimed to check whether the agent can retrieve and disambiguate the object the human referred to, if the human has the ability to correct misunderstanding using language, deictic gesture, or both; and human ease of use while interacting with the agent.Item Open Access A model for predicting the performance of sparse matrix vector multiply (SpMV) using memory bandwidth requirements and data locality(Colorado State University. Libraries, 2012) Dinkins, Stephanie, author; Strout, Michelle Mills, advisor; Rajopadhye, Sanjay V., committee member; Mueller, Jennifer L., committee memberSparse matrix vector multiply (SpMV) is an important computation that is used in many scientific and structural engineering applications. Sparse computations like SpMV require the use of special sparse data structures that efficiently store and access non-zeros. However, sparse data structures can tax the memory bandwidth of a machine and limit the performance of a computation, which for these computations is typically less than 10% of a processor's peak floating point performance. The goal of this thesis was to understand the trade-off between memory bandwidth needs and data locality for SpMV performance. We construct three performance models based on memory bandwidth requirements and the data locality that is associated with the non-zero orderings within a sparse data structure. Our approach uses metrics that can be applied to any sparse data structure to model SpMV performance. We use a data locality metric termed Manhattan distance to predict the data locality execution time of each non-zero ordering, which produced strong per matrix results. Based on the per matrix results for the Manhattan distance we constructed a model that computes a linear fit for multiple parameters, but those results were not promising. We found that the memory bandwidth requirement for a data structure is a key component to predicting performance. Our final model uses memory bandwidth pressure and an averaged predicted data locality time to accurately predict the fastest performing data structure 73-84% of the time, depending upon the machine. Our results indicate that often the sparse data structure with the least taxation on the memory bandwidth produces the fastest execution times.Item Open Access A multi-level code comprehension model for large scale software(Colorado State University. Libraries, 1996) Vans, A. Marie, author; von Mayrhauser, Anneliese, advisor; Bieman, James, committee member; Olender, Kurt, committee member; Volbrecht, Vicki, committee memberFor the past 20 years researchers have studied how programmers understand code they did not write. Most of this research has concentrated on small-scale code understanding. We consider it necessary to design studies that observe programmers working on large-scale code in production environments. We describe the design and implementation of such a study which included 11 maintenance engineers working on various maintenance tasks. The objective is to build a theory based on observations of programmers working on real tasks. Results show that programmers understand code at different levels of abstraction. Expertise in the application domain, amount of prior experience with the code, and task can determine the types of actions taken during maintenance, the level of abstraction at which the programmer works, and the information needed to complete a maintenance task. A better grasp of how programmers understand large scale code and what is most efficient and effective can lead to better tools, better maintenance guidelines, and documentation.Item Open Access A parametric classification of directed acyclic graphs(Colorado State University. Libraries, 2017) Chaturvedi, Mmanu, author; McConnell, Ross M., advisor; Kirby, Michael J., committee member; Rajopadhye, Sanjay V., committee member; Oprea, Iuliana, committee memberWe consider four NP-hard optimization problems on directed acyclic graphs (DAGs), namely, max clique, min coloring, max independent set and min clique cover. It is well-known that these four problems can be solved in polynomial time on transitive DAGs. It is also known that there can be no polynomial O(n1-ϵ)-approximation algorithms for these problems on the general class of DAGs unless P = NP. We propose a new parameter, β, as a measure of departure from transitivity for DAGs. We define β to be the number of vertices in a longest path in a DAG such that there is no edge from the first to the last vertex of the path, and 2 if the graph is transitive. Different values of β define a hierarchy of classes of DAGs, starting with the class of transitive DAGs. We give a polynomial time algorithm for finding a max clique when β is bounded by a fixed constant. The algorithm is exponential in β, but we also give a polynomial β-approximation algorithm. We prove that the other three decision problems are NP-hard even for β ≥ 4 and give polynomial algorithms with approximation bounds of β or better in each case. Furthermore, generalizing the definition of quasi-transitivity introduced by Ghouilà -Houri, we define β-quasi-transitivity and prove a more generalized version their theorem relating quasi-transitive orientation and transitive orientation.Item Open Access A performance evaluation of the coupling infrastructure within the Community Earth System Model™(Colorado State University. Libraries, 2018) Mickelson, Sheri A., author; Pouchet, Louis-Noel, advisor; Rajopadhye, Sanjay, committee member; Randall, David, committee memberEarth System models (ESMs) are complex software packages comprised of millions of lines of code used to simulate many different Earth processes. ESMs simulate the dynamical and physical states of the atmosphere, land, ocean, sea ice, rivers, and glaciers and coordinate the interactions between them. Many computational challenges exist within these models and future systems are putting more pressure on these challenges. In order to alleviate some of the pressure, it is important to study the performance challenges that exist within the models in order to understand the optimizations that need to be performed as we move to exascale systems. This work studies the performance of the coupling infrastructure between the modeling components within the Community Earth System Model. The coupler is responsible for the data exchanges between the different modeling components and while it has a small computational footprint, it has the potential to have a large impact on performance if the component resources are dispersed in incorrect proportions. This work explains and addresses this issue and provides easy solutions for users to save thousands of core cpu hours.Item Open Access A questionnaire integration system based on question classification and short text semantic textual similarity(Colorado State University. Libraries, 2018) Qiu, Yu, author; Pallickara, Sangmi Lee, advisor; Pallickara, Shrideep, committee member; Li, Kaigang, committee memberSemantic integration from heterogeneous sources involves a series of NLP tasks. Existing re- search has focused mainly on measuring two paired sentences. However, to find possible identical texts between two datasets, the sentences are not paired. To avoid pair-wise comparison, this thesis proposed a semantic similarity measuring system equipped with a precategorization module. It applies a hybrid question classification module, which subdivides all texts to coarse categories. The sentences are then paired from these subcategories. The core task is to detect identical texts between two sentences, which relates to the semantic textual similarity task in the NLP field. We built a short text semantic textual similarity measuring module. It combined conventional NLP techniques, including both semantic and syntactic features, with a Recurrent Convolutional Neural Network to accomplish an ensemble model. We also conducted a set of empirical evaluations. The results show that our system possesses a degree of generalization ability, and it performs well on heterogeneous sources.Item Open Access A scenario-based technique to analyze UML design class models(Colorado State University. Libraries, 2014) Yu, Lijun, author; France, Robert B., advisor; Ray, Indrakshi, committee member; Ghosh, Sudipto, committee member; Malaiya, Yashwant, committee member; Turk, Dan, committee memberIdentifying and resolving design problems in the early design phases can help reduce the number of design errors in implementations. In this dissertation a tool-supported lightweight static analysis technique is proposed to rigorously analyze UML design class models that include operations specified using the Object Constraint Language (OCL). A UML design class model is analyzed against a given set of scenarios that describe desired or undesired behaviors. The technique can leverage existing class model analysis tools such as USE and OCLE. The analysis technique is lightweight in that it analyzes functionality specified in a UML design class model within the scope of a given set of scenarios. It is static because it does not require that the UML design class model be executable. The technique is used to (1) transform a UML design class model to a snapshot transition model that captures valid state transitions, (2) transform given scenarios to snapshot transitions and (3) determine if the snapshot transitions conform or not to the snapshot transition model. A design inconsistency exists if snapshot transitions that represent desired behaviors do not conform to the snapshot transition model, or if snapshot transitions representing undesired behaviors conform to the snapshot transition model. A Scenario-based UML Design Analysis tool was developed using Kermeta and the Eclipse Modeling Framework. The tool can be used to transform an Ecore design class model to a snapshot transition model and transform scenarios to snapshot transitions. The tool is integrated with the USE analysis tool. We used the Scenario-based UML Design Analysis technique to analyze two design class models: a Train Management System model and a Generalized Spatio-Temporal RBAC model. The two demonstration case studies show how the technique can be used to analyze the inconsistencies between UML design class models and scenarios. We performed a pilot study to evaluate the effectiveness of the Scenario-based UML Design Analysis technique. In the pilot study the technique uncovered at least as many design inconsistencies as manual inspection techniques uncovered, and the technique did not uncover false inconsistencies. The pilot study provides some evidence that the Scenario-based UML Design Analysis technique is effective. The dissertation also proposes two scenario generation techniques. These techniques can be used to ease the manual effort needed to produce scenarios. The scenario generation techniques can be used to automatically generate a family of scenarios that conform to specified scenario generation criteria.