Browsing by Author "Vijayasarathy, Leo R., committee member"
Now showing 1 - 7 of 7
- Results Per Page
- Sort Options
Item Open Access An approach for testing the extract-transform-load process in data warehouse systems(Colorado State University. Libraries, 2018) Homayouni, Hajar, author; Ghosh, Sudipto, advisor; Ray, Indrakshi, advisor; Bieman, James M., committee member; Vijayasarathy, Leo R., committee memberEnterprises use data warehouses to accumulate data from multiple sources for data analysis and research. Since organizational decisions are often made based on the data stored in a data warehouse, all its components must be rigorously tested. In this thesis, we first present a comprehensive survey of data warehouse testing approaches, and then develop and evaluate an automated testing approach for validating the Extract-Transform-Load (ETL) process, which is a common activity in data warehousing. In the survey we present a classification framework that categorizes the testing and evaluation activities applied to the different components of data warehouses. These approaches include both dynamic analysis as well as static evaluation and manual inspections. The classification framework uses information related to what is tested in terms of the data warehouse component that is validated, and how it is tested in terms of various types of testing and evaluation approaches. We discuss the specific challenges and open problems for each component and propose research directions. The ETL process involves extracting data from source databases, transforming it into a form suitable for research and analysis, and loading it into a data warehouse. ETL processes can use complex one-to-one, many-to-one, and many-to-many transformations involving sources and targets that use different schemas, databases, and technologies. Since faulty implementations in any of the ETL steps can result in incorrect information in the target data warehouse, ETL processes must be thoroughly validated. In this thesis, we propose automated balancing tests that check for discrepancies between the data in the source databases and that in the target warehouse. Balancing tests ensure that the data obtained from the source databases is not lost or incorrectly modified by the ETL process. First, we categorize and define a set of properties to be checked in balancing tests. We identify various types of discrepancies that may exist between the source and the target data, and formalize three categories of properties, namely, completeness, consistency, and syntactic validity that must be checked during testing. Next, we automatically identify source-to-target mappings from ETL transformation rules provided in the specifications. We identify one-to-one, many-to-one, and many-to-many mappings for tables, records, and attributes involved in the ETL transformations. We automatically generate test assertions to verify the properties for balancing tests. We use the source-to-target mappings to automatically generate assertions corresponding to each property. The assertions compare the data in the target data warehouse with the corresponding data in the sources to verify the properties. We evaluate our approach on a health data warehouse that uses data sources with different data models running on different platforms. We demonstrate that our approach can find previously undetected real faults in the ETL implementation. We also provide an automatic mutation testing approach to evaluate the fault finding ability of our balancing tests. Using mutation analysis, we demonstrated that our auto-generated assertions can detect faults in the data inside the target data warehouse when faulty ETL scripts execute on mock source data.Item Open Access An empirical comparison of four Java-based regression test selection techniques(Colorado State University. Libraries, 2020) Shin, Min Kyung, author; Ghosh, Sudipto, advisor; Moreno Cubillos, Laura, committee member; Vijayasarathy, Leo R., committee memberRegression testing is crucial to ensure that previously tested functionality is not broken by additions, modifications, and deletions to the program code. Since regression testing is an expensive process, researchers have developed regression test selection (RTS) techniques, which select and execute only those test cases that are impacted by the code changes. In general, an RTS technique has two main activities, which are (1) determining dependencies between the source code and test cases, and (2) identifying the code changes. Different approaches exist in the research literature to compute dependencies statically or dynamically at different levels of granularity. Also, code changes can be identified at different levels of granularity using different techniques. As a result, RTS techniques possess different characteristics related to the amount of reduction in the test suite size, time to select and run the test cases, test selection accuracy, and fault detection ability of the selected subset of test cases. Researchers have empirically evaluated the RTS techniques, but the evaluations were generally conducted using different experimental settings. This thesis compares four recent Java-based RTS techniques, Ekstazi, HyRTS, OpenClover, and STARTS, with respect to the above-mentioned characteristics using multiple revisions from five open source projects. It investigates the relationship between four program features and the performance of RTS techniques: total (program and test suite) size in KLOC, total number of classes, percentage of test classes over the total number of classes, and the percentage of classes that changed between revisions. The results show that STARTS, a static RTS technique, over-estimates dependencies between test cases and program code, and thus, selects more test cases than the dynamic RTS techniques Ekstazi and HyRTS, even though all three identify code changes in the same way. OpenClover identifies code changes differently from Ekstazi, HyRTS, and STARTS, and selects more test cases. STARTS achieved the lowest safety violation with respect to Ekstazi, and HyRTS achieved the lowest precision violation with respect to both STARTS and Ekstazi. Overall, the average fault detection ability of the RTS techniques was 8.75% lower than that of the original test suite. STARTS, Ekstazi, and HyRTS achieved higher test suite size reduction on the projects with over 100 KLOC than those with less than 100 KLOC. OpenClover achieved a higher test suite size reduction in the subjects that had a fewer total number of classes. The time reduction of OpenClover is affected by the combination of the number of source classes and the number of test cases in the subjects. The higher the number of test cases and source classes, the lower the time reduction.Item Open Access Anomaly detection and explanation in big data(Colorado State University. Libraries, 2021) Homayouni, Hajar, author; Ghosh, Sudipto, advisor; Ray, Indrakshi, advisor; Bieman, James M., committee member; Ray, Indrajit, committee member; Vijayasarathy, Leo R., committee memberData quality tests are used to validate the data stored in databases and data warehouses, and to detect violations of syntactic and semantic constraints. Domain experts grapple with the issues related to the capturing of all the important constraints and checking that they are satisfied. The constraints are often identified in an ad hoc manner based on the knowledge of the application domain and the needs of the stakeholders. Constraints can exist over single or multiple attributes as well as records involving time series and sequences. The constraints involving multiple attributes can involve both linear and non-linear relationships among the attributes. We propose ADQuaTe as a data quality test framework that automatically (1) discovers different types of constraints from the data, (2) marks records that violate the constraints as suspicious, and (3) explains the violations. Domain knowledge is required to determine whether or not the suspicious records are actually faulty. The framework can incorporate feedback from domain experts to improve the accuracy of constraint discovery and anomaly detection. We instantiate ADQuaTe in two ways to detect anomalies in non-sequence and sequence data. The first instantiation (ADQuaTe2) uses an unsupervised approach called autoencoder for constraint discovery in non-sequence data. ADQuaTe2 is based on analyzing records in isolation to discover constraints among the attributes. We evaluate the effectiveness of ADQuaTe2 using real-world non-sequence datasets from the human health and plant diagnosis domains. We demonstrate that ADQuaTe2 can discover new constraints that were previously unspecified in existing data quality tests, and can report both previously detected and new faults in the data. We also use non-sequence datasets from the UCI repository to evaluate the improvement in the accuracy of ADQuaTe2 after incorporating ground truth knowledge and retraining the autoencoder model. The second instantiation (IDEAL) uses an unsupervised LSTM-autoencoder for constraint discovery in sequence data. IDEAL analyzes the correlations and dependencies among data records to discover constraints. We evaluate the effectiveness of IDEAL using datasets from Yahoo servers, NASA Shuttle, and Colorado State University Energy Institute. We demonstrate that IDEAL can detect previously known anomalies from these datasets. Using mutation analysis, we show that IDEAL can detect different types of injected faults. We also demonstrate that the accuracy of the approach improves after incorporating ground truth knowledge about the injected faults and retraining the LSTM-Autoencoder model. The novelty of this research lies in the development of a domain-independent framework that effectively and efficiently discovers different types of constraints from the data, detects and explains anomalous data, and minimizes false alarms through an interactive learning process.Item Open Access Digital signatures to ensure the authenticity and integrity of synthetic DNA molecules(Colorado State University. Libraries, 2019) Kar, Diptendu Mohan, author; Ray, Indrajit, advisor; Ray, Indrakshi, advisor; Vijayasarathy, Leo R., committee member; Peccoud, Jean, committee memberDNA synthesis has become increasingly common, and many synthetic DNA molecules are licensed intellectual property (IP). DNA samples are shared between academic labs, ordered from DNA synthesis companies and manipulated for a variety of different purposes, mostly to study their properties and improve upon them. However, it is not uncommon for a sample to change hands many times with very little accompanying information and no proof of origin. This poses significant challenges to the original inventor of a DNA molecule, trying to protect her IP rights. More importantly, following the anthrax attacks of 2001, there is an increased urgency to employ microbial forensic technologies to trace and track agent inventories. However, attribution of physical samples is next to impossible with existing technologies. In this research, we describe our efforts to solve this problem by embedding digital signatures in DNA molecules synthesized in the laboratory. We encounter several challenges that we do not face in the digital world. These challenges arise primarily from the fact that changes to a physical DNA molecule can affect its properties, random mutations can accumulate in DNA samples over time, DNA sequencers can sequence (read) DNA erroneously and DNA sequencing is still relatively expensive (which means that laboratories would prefer not to read and re-read their DNA samples to get error-free sequences). We address these challenges and present a digital signature technology that can be applied to synthetic DNA molecules in living cells.Item Open Access Integration of task-attribute based access control model for mobile workflow authorization and management(Colorado State University. Libraries, 2019) Basnet, Rejina, author; Ray, Indrakshi, advisor; Abdunabi, Ramadan, advisor; Ray, Indrajit, committee member; Vijayasarathy, Leo R., committee memberWorkflow is the automation of process logistics for managing simple every day to complex multi-user tasks. By defining a workflow with proper constraints, an organization can improve its efficiency, responsiveness, profitability, and security. In addition, mobile technology and cloud computing has enabled wireless data transmission, receipt and allows the workflows to be executed at any time and from any place. At the same time, security concerns arise because unauthorized users may get access to sensitive data or services from lost or stolen nomadic devices. Additionally, some tasks and information associated are location and time sensitive in nature. These security and usability challenges demand the employment of access control in a mobile workflow system to express fine-grained authorization rules for actors to perform tasks on-site and at certain time intervals. For example, if an individual is assigned a task to survey certain location, it is crucial that the individual is present in the very location while entering the data and all the data entered remotely is safe and secure. In this work, we formally defined an authorization model for mobile workflows. The authorization model was based on the NIST(Next Generation Access Control) where user attributes, resources attributes, and environment attributes decide who has access to what resources. In our model, we introduced the concept of spatio temporal zone attribute that captures the time and location as to when and where tasks could be executed. The model also captured the relationships between the various components and identified how they were dependent on time and location. It captured separation of duty constraints that prevented an authorized user from executing conflicting tasks and dependency of task constraints which imposed further restrictions on who could execute the tasks. The model was dynamic and allowed the access control configuration to change through obligations. The model had various constraints that may conflict with each other or introduce inconsistencies. Towards this end, we simulated the model using Timed Colored Petri Nets (TCPN) and ran queries to ensure the integrity of the model. The access control information was stored in the Neo4j graph database. We demonstrated the feasibility and usefulness of this method through performance analysis. Overall, we tended to explore and verify the necessity of access control for security as well as management of workflows. This work resulted in the development of secure, accountable, transparent, efficient, and usable workflows that could be deployed by enterprises.Item Open Access Trust based access control and its administration for smart IoT devices(Colorado State University. Libraries, 2024) Promi, Zarin Tasnim, author; Ray, Indrajit, advisor; Ray, Indrakshi, committee member; Vijayasarathy, Leo R., committee memberIn today's interconnected world, the security of Internet of Things (IoT) devices is paramount, given the types of smart devices ranging from household appliances to industrial machinery. The continuous, long-term operation of IoT networks increases vulnerability to attacks, and the limited capabilities of IoT devices render standard security measures less effective. Traditional cryptographic methods used for establishing trust through identification and authentication face challenges in IoT contexts due to their computational demands and scalability concerns. Additionally, administration for these intricate networks can become extensive, and the presence of malicious or unskilled human operators can further increase security risks. To combat these issues, adopting a "Zero Trust - Never Trust, Always Verify" strategy is vital in IoT environments. Our approach involves creating an access control model based on device trust, which continuously evaluates the trustworthiness of connected devices and dynamically modifies their access rights according to their trust levels. This enables adaptive and fine-grained access control in IoT settings. Furthermore, we propose a trust-based administrative framework that enables configuration policy, enhancing security and administration efficiency in IoT networks. Similarly to the access control model, this approach will continuously monitor the operator behavior and adjust their operational privileges based on their actions.Item Open Access Unbiased phishing detection using domain name based features(Colorado State University. Libraries, 2018) Shirazi, Hossein, author; Ray, Indrakshi, advisor; Malaiya, Yashwant K., committee member; Vijayasarathy, Leo R., committee memberInternet users are coming under a barrage of phishing attacks of increasing frequency and sophistication. While these attacks have been remarkably resilient against the vast range of defenses proposed by academia, industry, and research organizations, machine learning approaches appear to be a promising one in distinguishing between phishing and legitimate websites. There are three main concerns with existing machine learning approaches for phishing detection. The first concern is there is neither a framework, preferably open-source, for extracting feature and keeping the dataset updated nor an updated dataset of phishing and legitimate website. The second concern is the large number of features used and the lack of validating arguments for the choice of the features selected to train the machine learning classifier. The last concern relates to the type of datasets used in the literature that seems to be inadvertently biased with respect to the features based on URL or content. In this thesis, we describe the implementation of our open-source and extensible framework to extract features and create up-to-date phishing dataset. With having this framework, named Fresh-Phish, we implemented 29 different features that we used to detect whether a given website is legitimate or phishing. We used 26 features that were reported in related work and added 3 new features and created a dataset of 6,000 websites with these features of which 3,000 were malicious and 3,000 were genuine and tested our approach. Using 6 different classifiers we achieved the accuracy of 93% which is a reasonable high in this field. To address the second and third concerns, we put forward the intuition that the domain name of phishing websites is the tell-tale sign of phishing and holds the key to successful phishing detection. We focus on this aspect of phishing websites and design features that explore the relationship of the domain name to the key elements of the website. Our work differs from existing state-of-the-art as our feature set ensures that there is minimal or no bias with respect to a dataset. Our learning model trains with only seven features and achieves a true positive rate of 98% and a classification accuracy of 97%, on sample dataset. Compared to the state-of-the-art work, our per data instance processing and classification is 4 times faster for legitimate websites and 10 times faster for phishing websites. Importantly, we demonstrate the shortcomings of using features based on URLs as they are likely to be biased towards dataset collection and usage. We show the robustness of our learning algorithm by testing our classifiers on unknown live phishing URLs and achieve a higher detection accuracy of 99.7% compared to the earlier known best result of 95% detection rate.