Browsing by Author "Ben-Hur, Asa, committee member"
Now showing 1 - 20 of 33
Results Per Page
Sort Options
Item Open Access A framework for real-time, autonomous anomaly detection over voluminous time-series geospatial data streams(Colorado State University. Libraries, 2014) Budgaga, Walid, author; Pallickara, Shrideep, advisor; Pallickara, Sangmi Lee, advisor; Ben-Hur, Asa, committee member; Schumacher, Russ, committee memberIn this research work we present an approach encompassing both algorithm and system design to detect anomalies in data streams. Individual observations within these streams are multidimensional, with each dimension corresponding to a feature of interest. We consider time-series geospatial datasets generated by remote and in situ observational devices. Three aspects make this problem particularly challenging: (1) the cumulative volume and rates of data arrivals, (2) anomalies evolve over time, and (3) there are spatio-temporal correlations associated with the data. Therefore, anomaly detections must be accurate and performed in real time. Given the data volumes involved, solutions must minimize user intervention and be amenable to distributed processing to ensure scalability. Our approach achieves accurate, high throughput classications in real time. We rely on Expectation Maximization (EM) to build Gaussian Mixture Models (GMMs) that model the densities of the training data. Rather than one all-encompassing model, our approach involves multiple model instances, each of which is responsible for a particular geographical extent and can also adapt as data evolves. We have incorporated these algorithms into our distributed storage platform, Galileo, and proled their suitability through empirical analysis which demonstrates high throughput (10,000 observations per-second, per-node) and low latency on real-world datasets.Item Open Access Algorithms for feature selection and pattern recognition on Grassmann manifolds(Colorado State University. Libraries, 2015) Chepushtanova, Sofya, author; Kirby, Michael, advisor; Peterson, Chris, committee member; Bates, Dan, committee member; Ben-Hur, Asa, committee memberThis dissertation presents three distinct application-driven research projects united by ideas and topics from geometric data analysis, optimization, computational topology, and machine learning. We first consider hyperspectral band selection problem solved by using sparse support vector machines (SSVMs). A supervised embedded approach is proposed using the property of SSVMs to exhibit a model structure that includes a clearly identifiable gap between zero and non-zero feature vector weights that permits important bands to be definitively selected in conjunction with the classification problem. An SSVM is trained using bootstrap aggregating to obtain a sample of SSVM models to reduce variability in the band selection process. This preliminary sample approach for band selection is followed by a secondary band selection which involves retraining the SSVM to further reduce the set of bands retained. We propose and compare three adaptations of the SSVM band selection algorithm for the multiclass problem. We illustrate the performance of these methods on two benchmark hyperspectral data sets. Second, we propose an approach for capturing the signal variability in data using the framework of the Grassmann manifold (Grassmannian). Labeled points from each class are sampled and used to form abstract points on the Grassmannian. The resulting points have representations as orthonormal matrices and as such do not reside in Euclidean space in the usual sense. There are a variety of metrics which allow us to determine distance matrices that can be used to realize the Grassmannian as an embedding in Euclidean space. Multidimensional scaling (MDS) determines a low dimensional Euclidean embedding of the manifold, preserving or approximating the Grassmannian geometry based on the distance measure. We illustrate that we can achieve an isometric embedding of the Grassmann manifold using the chordal metric while this is not the case with other distances. However, non-isometric embeddings generated by using the smallest principal angle pseudometric on the Grassmannian lead to the best classification results: we observe that as the dimension of the Grassmannian grows, the accuracy of the classification grows to 100% in binary classification experiments. To build a classification model, we use SSVMs to perform simultaneous dimension selection. The resulting classifier selects a subset of dimensions of the embedding without loss in classification performance. Lastly, we present an application of persistent homology to the detection of chemical plumes in hyperspectral movies. The pixels of the raw hyperspectral data cubes are mapped to the geometric framework of the Grassmann manifold where they are analyzed, contrasting our approach with the more standard framework in Euclidean space. An advantage of this approach is that it allows the time slices in a hyperspectral movie to be collapsed to a sequence of points in such a way that some of the key structure within and between the slices is encoded by the points on the Grassmannian. This motivates the search for topological structure, associated with the evolution of the frames of a hyperspectral movie, within the corresponding points on the manifold. The proposed framework affords the processing of large data sets, such as the hyperspectral movies explored in this investigation, while retaining valuable discriminative information. For a particular choice of a distance metric on the Grassmannian, it is possible to generate topological signals that capture changes in the scene after a chemical release.Item Open Access An adaptation of K-means-type algorithms to the Grassmann manifold(Colorado State University. Libraries, 2019) Stiverson, Shannon J., author; Kirby, Michael, advisor; Adams, Henry, committee member; Ben-Hur, Asa, committee memberThe Grassmann manifold provides a robust framework for analysis of high-dimensional data through the use of subspaces. Treating data as subspaces allows for separability between data classes that is not otherwise achieved in Euclidean space, particularly with the use of the smallest principal angle pseudometric. Clustering algorithms focus on identifying similarities within data and highlighting the underlying structure. To exploit the properties of the Grassmannian for unsupervised data analysis, two variations of the popular K-means algorithm are adapted to perform clustering directly on the manifold. We provide the theoretical foundations needed for computations on the Grassmann manifold and detailed derivations of the key equations. Both algorithms are then thoroughly tested on toy data and two benchmark data sets from machine learning: the MNIST handwritten digit database and the AVIRIS Indian Pines hyperspectral data. Performance of algorithms is tested on manifolds of varying dimension. Unsupervised classification results on the benchmark data are compared to those currently found in the literature.Item Open Access Analysis of wheat spike characteristics using image analysis, machine learning, and genomics(Colorado State University. Libraries, 2022) Hammers, Mikayla, author; Mason, Esten, advisor; Ben-Hur, Asa, committee member; Mueller, Nathan, committee member; Rhodes, Davina, committee memberUnderstanding genetics regulating yield component and spike traits can contribute to the development of new wheat cultivars. The flowering pathway in wheat is not entirely known, but spike architecture and its relationship with yield component traits could provide valuable information for crop improvement. Spikelets spike-1 (SPS) has previously been positively associated with kernel number spike (KNS) and negatively correlated with thousand kernel weight, meaning a further understanding of SPS could help unlock full yield potential. While genomics research has improved efficiency over time with the development of techniques such as genotyping by sequencing (GBS), phenotyping remains a labor and time intensive process, limiting the amount of phenomic data available for research. Recently, there has been more interest in generating high-throughput methods for collecting and analyzing phenotypic data. Imaging is a cheap and easily reproducible way to collect data at a specific maturity point or over time, and is a promising candidate for implementing deep learning algorithms to extract traits of interest. For this study, a population of 594 soft red winter wheat (SRWW) inbred lines were evaluated for wheat spike characteristics over two years. Images of wheat spikes were taken in a controlled environment and used to train deep learning algorithms to count SPS. A total of 12,717 images were prepared for analysis and used to train, test, and validate a basic classification and regression convolutional neural network (CNN), as well as a VGG16 and VGG19 regression model. Classification had a low accuracy and did not allow for an assessment of error margins. Regression models were more accurate. Of the regression models, VGG16 had the lowest mean absolute error (MAE) (MAE = 1.09) and mean squared error (MSE) (MSE = 2.08), and the highest coefficient of determination (R2) (R2 = 0.53) meaning it had the best fit of all models. The basic CNN was the next well fit model (MAE = 1.27, MSE = 2.61, r = 0.48) followed by the VGG19 (MAE = 1.32, MSE = 2.98, r = 0.45). With an average error of just above one spikelet, it is possible that counting methods could provide enough data with an accuracy high enough for use in statistical analyses such as genome wide association studies (GWAS), or genomic selection (GS). A GWAS was used to identify markers associated with SPS and yield component traits, while demonstrating the use of genomic selection (GS) for prediction and screening of individuals across multiple breeding programs. The GWAS results indicated similar markers and genotypic regions underpinning both KNS and SPS on chromosome 6A and spike length and SPS on chromosome 7A. It was observed that favorable alleles at each locus were associated with higher KNS and SPS on chromosome 6A and longer wheat spikes with higher SPS on chromosome 7A. Significant markers on 7A were observed in the region near WAPO1, the causal gene for SPS on the long arm of chromosome 7A, indicating they could be associated with that gene. GS results showed promise for whole genome selection, with the lowest prediction accuracy observed for heading date (rgs = 0.30) and the highest for spike area (rgs = 0.62). SPS showed prediction accuracies ranging from 0.33 to 0.42, high enough to aid in the selection process. These results indicate that knowledge of the flowering pathway and wheat spike architecture and how it relates to yield components could be beneficial for making selections and increasing grain yield.Item Open Access aPPRove: an HMM-based method for accurate prediction of RNA-pentatricopeptide repeat protein binding events(Colorado State University. Libraries, 2015) Harrison, Thomas, author; Boucher, Christina, advisor; Ben-Hur, Asa, committee member; Sloan, Daniel, committee memberPentatricopeptide repeat containing proteins (PPRs) bind to RNA transcripts originating from mitochondria and plastids. There are two classes of PPR proteins. The P class contains tandem P-type motif sequences, and the PLS class contains alternating P, L and S type motif sequences. In this paper, we describe a novel tool that predicts PPR-RNA interaction; specifically, our method, which we call aPPRove, determines where and how a PLS-class PPR protein will bind to RNA when given a PPR and one or more RNA transcripts by using a combinatorial binding code for site specificity proposed by Barkan et al. [1]. Our results demonstrate that aPPRove successfully locates how and where a PPR protein belonging to the PLS class can bind to RNA. For each binding event it outputs the binding site, the amino-acid-nucleotide interaction, and its statistical significance. Furthermore, we show that our method can be used to predict binding events for PLS-class proteins using a known edit site and the statistical significance of aligning the PPR protein to that site. In particular we use our method to make a conjecture regarding a novel binding event between CLB19 and the second intronic region of ycf3. The aPPRove web server can be found at www.cs.colostate.edu/~aPPRove and the soft- ware is available at that website for stand alone usage.Item Open Access Automatically detecting task unrelated thoughts during conversations using keystroke analysis(Colorado State University. Libraries, 2022) Kuvar, Vishal Kiran, author; Blanchard, Nathaniel, advisor; Mills, Caitlin, advisor; Ben-Hur, Asa, committee member; Zhou, Wen, committee memberTask-unrelated thought (TUT), commonly known as the phenomenon of daydreaming or zoning- out, is a mental state where a person's attention moves away from the task-at-hand to self-generated thoughts. This state is extremely common yet not much is known about it during dyadic interactions. We built a model to detect when a person experiences TUTs while talking to another person through a chat platform, by analyzing their keystroke patterns. This model was able to differentiate between task-unrelated thoughts and task-related thoughts with a kappa of 0.343. This serves as a strong indicator that typing behavior is linked with mental states, task-unrelated thoughts in our case.Item Open Access Behavioral complexity analysis of networked systems to identify malware attacks(Colorado State University. Libraries, 2020) Haefner, Kyle, author; Ray, Indrakshi, advisor; Ben-Hur, Asa, committee member; Gersch, Joe, committee member; Hayne, Stephen, committee member; Ray, Indrajit, committee memberInternet of Things (IoT) environments are often composed of a diverse set of devices that span a broad range of functionality, making them a challenge to secure. This diversity of function leads to a commensurate diversity in network traffic, some devices have simple network footprints and some devices have complex network footprints. This network-complexity in a device's traffic provides a differentiator that can be used by the network to distinguish which devices are most effectively managed autonomously and which devices are not. This study proposes an informed autonomous learning method by quantifying the complexity of a device based on historic traffic and applies this complexity metric to build a probabilistic model of the device's normal behavior using a Gaussian Mixture Model (GMM). This method results in an anomaly detection classifier with inlier probability thresholds customized to the complexity of each device without requiring labeled data. The model efficacy is then evaluated using seven common types of real malware traffic and across four device datasets of network traffic: one residential-based, two from labs, and one consisting of commercial automation devices. The results of the analysis of over 100 devices and 800 experiments show that the model leads to highly accurate representations of the devices and a strong correlation between the measured complexity of a device and the accuracy to which its network behavior can be modeled.Item Open Access Comparison of EEG preprocessing methods to improve the performance of the P300 speller(Colorado State University. Libraries, 2011) Cashero, Zachary, author; Anderson, Charles, advisor; Chen, Thomas, advisor; Tobet, Stuart, committee member; Ben-Hur, Asa, committee memberThe classification of P300 trials in electroencephalographic (EEG) data is made difficult due the low signal-to-noise ratio (SNR) of the P300 response. To overcome the low SNR of individual trials, it is common practice to average together many consecutive trials, which effectively diminishes the random noise. Unfortunately, when more repeated trials are required for applications such as the P300 speller, the communication rate is greatly reduced. Since the noise results from background brain activity and is inherent to the EEG recording methods, signal analysis techniques like blind source separation (BSS) have the potential to isolate the true source signal from the noise when using multi-channel recordings. This thesis provides a comparison of three BSS algorithms: independent component analysis (ICA), maximum noise fraction (MNF), and principal component analysis (PCA). In addition to this, the effects of adding temporal information to the original data, thereby creating time-delay embedded data, will be analyzed. The BSS methods can utilize this time-delay embedded data to find more complex spatio-temporal filters rather than the purely spatial filters found using the original data. One problem that is intrinsically tied to the application of BSS methods is the selection of the most relevant source components that are returned from each BSS algorithm. In this work, the following feature selection algorithms are adapted to be used for component selection: forward selection, ANOVA-based ranking, Relief, and recursive feature elimination (RFE). The performance metric used for all comparisons is the classification accuracy of P300 trials using a support vector machine (SVM) with a Gaussian kernel. The results show that although both BSS and feature selection algorithms can each cause significant performance gains, there is no added benefit from using both together. Feature selection is most beneficial when applied to a large number of electrodes, and BSS is most beneficial when applied to a smaller set of electrodes. Also, the results show that time-delay embedding is not beneficial for P300 classification.Item Open Access Computational investigation of biological dose-volume outcome predictors in 29 canine nasal tumor patients treated with stereotactic radiation therapy(Colorado State University. Libraries, 2012) McBeth, Rafe, author; Zhang, Dongqing, advisor; LaRue, Susan, committee member; Custis, James, committee member; Ben-Hur, Asa, committee memberThe ability to mathematically model biological response to radiation dose in the tumors of cancer patients is a significant goal for the medical physics community. Although much work has been done in this area, novel treatment approaches are challenging the current knowledge of the radiation biology and oncology communities. In particular, doses five to ten times higher than traditional treatments are prescribed in stereotactic radiation therapy. These new treatment techniques are thought to have different mechanisms that cause cell death in comparison to classical treatments. These extraordinarily high doses are made possible by using advanced imaging, treatment planning, linear accelerator capabilities and immobilization to precisely target cancer while sparing healthy normal tissue. Biologically guided radiation therapy (BGRT) and biologically based treatment planning (BBTP) methods offer the next attractive step forward in radiation therapy. To examine the capabilities of biological based dose parameters, a mature data set of 29 canine nasal tumor patients was analyzed using the generalized equivalent uniform dose (gEUD) and the dose to a relative volume.Over one hundred individual predictors were inspected, with greater than five thousand individual tests, in search of optimal indicators of patient outcome.Testing showed that high negative gEUD values and the minimum dose to the tumor were highly significant predictors in the outcome of the patients. However, more robust techniqes need to be added to the analysis in order to validate these results.Item Open Access Convex and non-convex optimization using centroid-encoding for visualization, classification, and feature selection(Colorado State University. Libraries, 2022) Ghosh, Tomojit, author; Kirby, Michael, advisor; Anderson, Charles, committee member; Ben-Hur, Asa, committee member; Adams, Henry, committee memberClassification, visualization, and feature selection are the three essential tasks of machine learning. This Ph.D. dissertation presents convex and non-convex models suitable for these three tasks. We propose Centroid-Encoder (CE), an autoencoder-based supervised tool for visualizing complex and potentially large, e.g., SUSY with 5 million samples and high-dimensional datasets, e.g., GSE73072 clinical challenge data. Unlike an autoencoder, which maps a point to itself, a centroid-encoder has a modified target, i.e., the class centroid in the ambient space. We present a detailed comparative analysis of the method using various data sets and state-of-the-art techniques. We have proposed a variation of the centroid-encoder, Bottleneck Centroid-Encoder (BCE), where additional constraints are imposed at the bottleneck layer to improve generalization performance in the reduced space. We further developed a sparse optimization problem for the non-linear mapping of the centroid-encoder called Sparse Centroid-Encoder (SCE) to determine the set of discriminate features between two or more classes. The sparse model selects variables using the 1-norm applied to the input feature space. SCE extracts discriminative features from multi-modal data sets, i.e., data whose classes appear to have multiple clusters, by using several centers per class. This approach seems to have advantages over models which use a one-hot-encoding vector. We also provide a feature selection framework that first ranks each feature by its occurrence, and the optimal number of features is chosen using a validation set. CE and SCE are models based on neural network architectures and require the solution of non-convex optimization problems. Motivated by the CE algorithm, we have developed a convex optimization for the supervised dimensionality reduction technique called Centroid Component Retrieval (CCR). The CCR model optimizes a multi-objective cost by balancing two complementary terms. The first term pulls the samples of a class towards its centroid by minimizing a sample's distance from its class centroid in low dimensional space. The second term pushes the classes by maximizing the scattering volume of the ellipsoid formed by the class-centroids in embedded space. Although the design principle of CCR is similar to LDA, our experimental results show that CCR exhibits performance advantages over LDA, especially on high-dimensional data sets, e.g., Yale Faces, ORL, and COIL20. Finally, we present a linear formulation of Centroid-Encoder with orthogonality constraints, called Principal Centroid Component Analysis (PCCA). This formulation is similar to PCA, except the class labels are used to formulate the objective, resulting in the form of supervised PCA. We show the classification and visualization experiments results with this new linear tool.Item Open Access Convolutional neural networks for EEG signal classification in asynchronous brain-computer interfaces(Colorado State University. Libraries, 2019) Forney, Elliott M., author; Anderson, Charles, advisor; Ben-Hur, Asa, committee member; Kirby, Michael, committee member; Rojas, Donald, committee memberBrain-Computer Interfaces (BCIs) are emerging technologies that enable users to interact with computerized devices using only voluntary changes in their mental state. BCIs have a number of important applications, especially in the development of assistive technologies for people with motor impairments. Asynchronous BCIs are systems that aim to establish smooth, continuous control of devices like mouse cursors, electric wheelchairs and robotic prostheses without requiring the user to interact with time-locked external stimuli. Scalp-recorded Electroencephalography (EEG) is a noninvasive approach for measuring brain activity that shows considerable potential for use in BCIs. Inferring a user's intent from spontaneously produced EEG signals remains a challenging problem, however, and generally requires specialized machine learning and signal processing methods. Current approaches typically involve guided preprocessing and feature generation procedures used in combination with with carefully regularized, often linear, classification algorithms. The current trend in machine learning, however, is to move away from approaches that rely on feature engineering in favor of multilayer (deep) artificial neural networks that rely on few prior assumptions and are capable of automatically learning hierarchical, multiscale representations. Along these lines, we propose several variants of the Convolutional Neural Network (CNN) architecture that are specifically designed for classifying EEG signals in asynchronous BCIs. These networks perform convolutions across time with dense connectivity across channels, which allows them to capture spatiotemporal patterns while achieving time invariance. Class labels are assigned using linear readout layers with label aggregation in order to reduce susceptibility to overfitting and to allow for continuous control. We also utilize transfer learning in order to reduce overfitting and leverage patterns that are common across individuals. We show that these networks are multilayer generalizations of Time-Delay Neural Networks (TDNNs) and that the convolutional units in these networks can be interpreted as learned, multivariate, nonlinear, finite impulse-response filters. We perform a series of offline experiments using EEG data recorded during four imagined mental tasks: silently count backward from 100 by 3's, imagine making a left-handed fist, visualize a rotating cube and silently sing a favorite song. Data were collected using a portable, eight-channel EEG system from 10 participants with no impairments in a laboratory setting and four participants with motor impairments in their home environments. Experimental results demonstrate that our proposed CNNs consistently outperform baseline classifiers that utilize power-spectral densities. Transfer learning yields an additional performance improvement, but only when used in combination with multilayer networks. Our final test results achieve a mean classification accuracy of 57.86%, which is 8.57% higher than the 49.29% achieved by our baseline classifiers. In terms of information transfer rates, our proposed methods achieve a mean of 15.82 bits-per-minute while our baseline methods achieve 9.35 bits-per-minute. For two individuals, our CNNs achieve a classification accuracy of 90.00%, which is 10-20% higher than our baseline methods. A comparison with external studies suggests that these results are on par with the state-of-the-art, despite our relatively rigorous experimental design. We also perform a number of experiments that analyze the types of patterns our classifiers learn to utilize. This includes a detailed analysis of aggregate power-spectral densities, examining the layer-wise activations produced by our CNNs, extracting the frequency responses of convolutional layers using Fourier analysis and finding optimized input sequences for trained networks. These analyses highlight several ways that the patterns our methods learn to utilize are related to known patterns that occur in EEG signals while also creating new questions about some types of patterns, including high-frequency information. Examining the behavior of our CNNs also provides insights into the inner workings of these networks and demonstrates that they are, in fact, learning to form hierarchical, multiscale representations of EEG signals.Item Open Access EEG subspace analysis and classification using principal angles for brain-computer interfaces(Colorado State University. Libraries, 2015) Ashari, Rehab Bahaaddin, author; Anderson, Charles W., advisor; Ben-Hur, Asa, committee member; Draper, Bruce, committee member; Peterson, Chris, committee memberBrain-Computer Interfaces (BCIs) help paralyzed people who have lost some or all of their ability to communicate and control the outside environment from loss of voluntary muscle control. Most BCIs are based on the classification of multichannel electroencephalography (EEG) signals recorded from users as they respond to external stimuli or perform various mental activities. The classification process is fraught with difficulties caused by electrical noise, signal artifacts, and nonstationarity. One approach to reducing the effects of similar difficulties in other domains is the use of principal angles between subspaces, which has been applied mostly to video sequences. This dissertation studies and examines different ideas using principal angles and subspaces concepts. It introduces a novel mathematical approach for comparing sets of EEG signals for use in new BCI technology. The success of the presented results show that principal angles are also a useful approach to the classification of EEG signals that are recorded during a BCI typing application. In this application, the appearance of a subject's desired letter is detected by identifying a P300-wave within a one-second window of EEG following the flash of a letter. Smoothing the signals before using them is the only preprocessing step that was implemented in this study. The smoothing process based on minimizing the second derivative in time is implemented to increase the classification accuracy instead of using the bandpass filter that relies on assumptions on the frequency content of EEG. This study examines four different ways of removing outliers that are based on the principal angles and shows that the outlier removal methods did not help in the presented situations. One of the concepts that this dissertation focused on is the effect of the number of trials on the classification accuracies. The achievement of the good classification results by using a small number of trials starting from two trials only, should make this approach more appropriate for online BCI applications. In order to understand and test how EEG signals are different from one subject to another, different users are tested in this dissertation, some with motor impairments. Furthermore, the concept of transferring information between subjects is examined by training the approach on one subject and testing it on the other subject using the training subject's EEG subspaces to classify the testing subject's trials.Item Open Access Electroencephalogram classification by forecasting with recurrent neural networks(Colorado State University. Libraries, 2011) Forney, Elliott M., author; Anderson, Charles, advisor; Ben-Hur, Asa, committee member; Gavin, William, committee memberThe ability to effectively classify electroencephalograms (EEG) is the foundation for building usable Brain-Computer Interfaces as well as improving the performance of EEG analysis software used in clinical and research settings. Although a number of research groups have demonstrated the feasibility of EEG classification, these methods have not yet reached a level of performance that is acceptable for use in many practical applications. We assert that current approaches are limited by their ability to capture the temporal and spatial patterns contained within EEG. In order to address these problems, we propose a new generative technique for EEG classification that uses Elman Recurrent Neural Networks. EEG recorded while a subject performs one of several imagined mental tasks is first modeled by training a network to forecast the signal a single step ahead in time. We show that these models are able to forecast EEG with an error as low as 1.18 percent of the signal range. A separate model is then trained over EEG belonging to each class. Classification of previously unseen data is performed by applying each model and using Winner-Takes-All, Linear Discriminant Analysis or Quadratic Discriminant Analysis to label the forecasting errors. This approach is tested on EEG collected from two able-bodied subjects and three subjects with disabilities. Information transfer rates as high as 38.7 bits per minute (bpm) are achieved for a two-task problem and 34.5bpm for a four-task problem.Item Open Access Generative topographic mapping of electroencephalography (EEG) data(Colorado State University. Libraries, 2014) Dantanarayana, Navini, author; Anderson, Charles, advisor; Ben-Hur, Asa, committee member; Davies, Patricia, committee memberGenerative Topographic Mapping (GTM) assumes that the features of high dimensional data can be described by a few variables (usually 1 or 2). Based on this assumption, the GTM trains unsupervised on the high dimensional data to find these variables from which the features can be generated. The variables can be used to represent and visualize the original data on a low dimensional space. Here, we have applied the GTM algorithm on Electroencephalography (EEG) signals in order to find a two dimensional representation for them. The 2-D representation can also be used to classify the EEG signals with P300 waves, an Event Related Potential (ERP) that occurs when the subject identifies a rare but expected stimulus. Furthermore, unsupervised feature learning capability of the GTM algorithm is investigated by providing EEG signals of different subjects and protocols. The results indicate that the algorithm successfully captures the feature variations in the data when generating the 2-D representation, therefore can be efficiently used as a powerful data visualization and analysis tool.Item Open Access In vitro and in vivo studies on pre-mRNA splicing in plants(Colorado State University. Libraries, 2017) Albaqami, Mohammed M., author; Reddy, A. S. N., advisor; Wilusz, Jeffrey, committee member; Ben-Hur, Asa, committee member; Montgomery, Tai, committee memberTo view the abstract, please see the full text of the document.Item Open Access Leveraging ensembles: balancing timeliness and accuracy for model training over voluminous datasets(Colorado State University. Libraries, 2020) Budgaga, Walid, author; Pallickara, Shrideep, advisor; Pallickara, Sangmi Lee, advisor; Ben-Hur, Asa, committee member; Breidt, F. Jay, committee memberAs data volumes increase, there is a pressing need to make sense of the data in a timely fashion. Voluminous datasets are often high dimensional, with individual data points representing a vector of features. Data scientists fit models to the data—using all features or a subset thereof—and then use these models to inform their understanding of phenomena or make predictions. The performance of these analytical models is assessed based on their accuracy and ability to generalize on unseen data. Several existing frameworks can be used for drawing insights from voluminous datasets. However, there are some inefficiencies associated with these frameworks including scalability, limited applicability beyond a target domain, prolonged training times, poor resource utilization, and insufficient support for combining diverse model fitting algorithms. In this dissertation, we describe our methodology for scalable supervised learning over voluminous datasets. The methodology explores the impact of partitioning the feature space, building models over these partitioned subsets of the data, and their impact on training times and accuracy. Using our methodology, a practitioner can harness a mix of learning methods to build diverse models over the partitioned data. Rather than build a single, all-encompassing model we construct an ensemble of models trained independently over different portions of the dataset. In particular, we rely on concurrent and independent learning from different portions of the data space to overcome the issues relating to resource utilization and completion times associated with distributed training of a single model over the entire dataset. Our empirical benchmarks are performed using datasets from diverse domains, including epidemiology, music, and weather. These benchmarks demonstrate the suitability of our methodology for reducing training times while preserving accuracy in contrast to those obtained from a complex model trained on the entire dataset. In particular, our methodology utilizes resources effectively by amortizing I/O and CPU costs by relying on a distributed environment while ensuring a significant reduction of network traffic during training.Item Open Access On the use of locality aware distributed hash tables for homology searches over voluminous biological sequence data(Colorado State University. Libraries, 2015) Tolooee, Cameron, author; Pallickara, Sangmi, advisor; Ben-Hur, Asa, committee member; von Fischer, Joseph, committee memberRapid advances in genomic sequencing technology have resulted in a data deluge in biology and bioinformatics. This increase in data volumes has introduced computational challenges for frequently performed sequence analytics routines such as DNA and protein homology searches; these must also preferably be done in real-time. This thesis proposes a scalable and similarity-aware distributed storage framework, Mendel, that enables retrieval of biologically significant DNA and protein alignments against a voluminous genomic sequence database. Mendel fragments the sequence data and generates an inverted-index, which is then dispersed over a distributed collection of machines using a locality aware distributed hash table. A novel distributed nearest neighbor search algorithm identifies sequence segments with high similarity and splices them together to form an alignment. This paper includes an empirical evaluation of the performance, sensitivity, and scalability of the proposed system over the NCBI's non-redundant protein dataset. In these benchmarks, Mendel demonstrates higher sensitivity and faster query evaluations when compared to other modern frameworks.Item Open Access P300 classification using deep belief nets(Colorado State University. Libraries, 2014) Sobhani, Amin, author; Anderson, Charles, advisor; Ben-Hur, Asa, committee member; Peterson, Chris, committee memberElectroencephalogram (EEG) is measure of the electrical activity of the brain. One of the most important EEG paradigm that has been explored in BCI systems is the P300 signal. The P300 wave is an endogenous event-related-potential which can be captured during the process of decision making as a subject reacts to a stimulus. One way to detect the P300 signal is to show a subject two types of visual stimuli occurring at different rates. The event occurring less frequently than the other elicits a positive signal component with a latency of roughly 250-500 ms. P300 detection has many applications in the BCI field. One of the most common applications of P300 detection is the P300 speller which enables users to type letters on the screen. Machine Learning algorithms play a crucial role in designing a BCI system. One important purpose of using the machine learning algorithms in BCI systems is the classification of EEG signals. In order to translate EEG signals to a control signal, BCI systems should first capture the pattern of EEG signals and discriminate them into different command categories. This is usually done using different machine learning-based classifiers. In the past, different linear and nonlinear methods have been used to discriminate the P300 signals from nonP300 signals. This thesis provides the first attempt to implement and examine the performance of the Deep Belief Networks (DBN) to model the P300 data for classification. The highest classification accuracy we achieved with DBN is 97 percent for testing trials. In our experiments, we used EEG data collected by the BCI lab at Colorado State University on both healthy and disabled subjects.Item Open Access Persistence and simplicial metric thickenings(Colorado State University. Libraries, 2024) Moy, Michael, author; Adams, Henry, advisor; Patel, Amit, committee member; Peterson, Christopher, committee member; Ben-Hur, Asa, committee memberThis dissertation examines the theory of one-dimensional persistence with an emphasis on simplicial metric thickenings and studies two particular filtrations of simplicial metric thickenings in detail. It gives self-contained proofs of foundational results on one-parameter persistence modules of vector spaces, including interval decomposability, existence of persistence diagrams and barcodes, and the isometry theorem. These results are applied to prove the stability of persistent homology for sublevel set filtrations, simplicial complexes, and simplicial metric thickenings. The filtrations of simplicial metric thickenings studied in detail are the Vietoris–Rips and anti-Vietoris–Rips metric thickenings of the circle. The study of the Vietoris–Rips metric thickenings is motivated by persistent homology and its use in applied topology, and it builds on previous work on their simplicial complex counterparts. On the other hand, the study of the anti-Vietoris–Rips metric thickenings is motivated by their connections to graph colorings. In both cases, the homotopy types of these spaces are shown to be odd-dimensional spheres, with dimensions depending on the scale parameters.Item Open Access Persistence stability for metric thickenings(Colorado State University. Libraries, 2021) Moy, Michael, author; Adams, Henry, advisor; King, Emily, committee member; Ben-Hur, Asa, committee memberPersistent homology often begins with a filtered simplicial complex, such as the Vietoris–Ripscomplex or the Čech complex, the vertex set of which is a metric space. An important result, the stability of persistent homology, shows that for certain types of filtered simplicial complexes, two metric spaces that are close in the Gromov–Hausdorff distance result in persistence diagrams that are close in the bottleneck distance. The recent interest in persistent homology has motivated work to better understand the homotopy types and persistent homology of these commonly used simplicial complexes. This has led to the definition of metric thickenings, which agree with simplicial complexes for finite vertex sets but may have different topologies for infinite vertex sets. We prove Vietoris–Rips metric thickenings and Čech metric thickenings have the same persistence diagrams as their corresponding simplicial complexes for all totally bounded metric spaces. This immediately implies the stability of persistent homology for these metric thickenings.