Browsing by Author "Pallickara, Sangmi Lee, advisor"

Now showing 1 - 17 of 17

Open Access
A framework for real-time, autonomous anomaly detection over voluminous time-series geospatial data streams
(Colorado State University. Libraries, 2014) Budgaga, Walid, author; Pallickara, Shrideep, advisor; Pallickara, Sangmi Lee, advisor; Ben-Hur, Asa, committee member; Schumacher, Russ, committee member
In this research work we present an approach encompassing both algorithm and system design to detect anomalies in data streams. Individual observations within these streams are multidimensional, with each dimension corresponding to a feature of interest. We consider time-series geospatial datasets generated by remote and in situ observational devices. Three aspects make this problem particularly challenging: (1) the cumulative volume and rates of data arrivals, (2) anomalies evolve over time, and (3) there are spatio-temporal correlations associated with the data. Therefore, anomaly detections must be accurate and performed in real time. Given the data volumes involved, solutions must minimize user intervention and be amenable to distributed processing to ensure scalability. Our approach achieves accurate, high throughput classications in real time. We rely on Expectation Maximization (EM) to build Gaussian Mixture Models (GMMs) that model the densities of the training data. Rather than one all-encompassing model, our approach involves multiple model instances, each of which is responsible for a particular geographical extent and can also adapt as data evolves. We have incorporated these algorithms into our distributed storage platform, Galileo, and proled their suitability through empirical analysis which demonstrates high throughput (10,000 observations per-second, per-node) and low latency on real-world datasets.
Open Access
A framework for resource efficient profiling of spatial model performance
(Colorado State University. Libraries, 2022) Carlson, Caleb, author; Pallickara, Shrideep, advisor; Pallickara, Sangmi Lee, advisor; Adams, Henry, committee member
We design models to understand phenomena, make predictions, and/or inform decision-making. This study targets models that encapsulate spatially evolving phenomena. Given a model M, our objective is to identify how well the model predicts across all geospatial extents. A modeler may expect these validations to occur at varying spatial resolutions (e.g., states, counties, towns, census tracts). Assessing a model with all available ground-truth data is infeasible due to the data volumes involved. We propose a framework to assess the performance of models at scale over diverse spatial data collections. Our methodology ensures orchestration of validation workloads while reducing memory strain, alleviating contention, enabling concurrency, and ensuring high throughput. We introduce the notion of a validation budget that represents an upper-bound on the total number of observations that are used to assess the performance of models across spatial extents. The validation budget attempts to capture the distribution characteristics of observations and is informed by multiple sampling strategies. Our design allows us to decouple the validation from the underlying model-fitting libraries to interoperate with models designed using different libraries and analytical engines; our advanced research prototype currently supports Scikit-learn, PyTorch, and TensorFlow. We have conducted extensive benchmarks that demonstrate the suitability of our methodology.
Open Access
A locality-aware scientific workflow engine for fast-evolving spatiotemporal sensor data
(Colorado State University. Libraries, 2017) Kachikaran Arulswamy, Johnson Charles, author; Pallickara, Sangmi Lee, advisor; Pallickara, Shrideep, committee member; von Fischer, Joseph, committee member
Discerning knowledge from voluminous data involves a series of data manipulation steps. Scientists typically compose and execute workflows for these steps using scientific workflow management systems (SWfMSs). SWfMSs have been developed for several research communities including but not limited to bioinformatics, biology, astronomy, computational science, and physics. Parallel execution of workflows has been widely employed in SWfMSs by exploiting the storage and computing resources of grid and cloud services. However, none of these systems have been tailored for the needs of spatiotemporal analytics on real-time sensor data with high arrival rates. This thesis demonstrates the development and evaluation of a target-oriented workflow model that enables a user to specify dependencies among the workflow components, including data availability. The underlying spatiotemporal data dispersion and indexing scheme provides fast data search and retrieval to plan and execute computations comprising the workflow. This work includes a scheduling algorithm that targets minimizing data movement across machines while ensuring fair and efficient resource allocation among multiple users. The study includes empirical evaluations performed on the Google cloud.
Open Access
A questionnaire integration system based on question classification and short text semantic textual similarity
(Colorado State University. Libraries, 2018) Qiu, Yu, author; Pallickara, Sangmi Lee, advisor; Pallickara, Shrideep, committee member; Li, Kaigang, committee member
Semantic integration from heterogeneous sources involves a series of NLP tasks. Existing re- search has focused mainly on measuring two paired sentences. However, to find possible identical texts between two datasets, the sentences are not paired. To avoid pair-wise comparison, this thesis proposed a semantic similarity measuring system equipped with a precategorization module. It applies a hybrid question classification module, which subdivides all texts to coarse categories. The sentences are then paired from these subcategories. The core task is to detect identical texts between two sentences, which relates to the semantic textual similarity task in the NLP field. We built a short text semantic textual similarity measuring module. It combined conventional NLP techniques, including both semantic and syntactic features, with a Recurrent Convolutional Neural Network to accomplish an ensemble model. We also conducted a set of empirical evaluations. The results show that our system possesses a degree of generalization ability, and it performs well on heterogeneous sources.
Open Access
Adaptive spatiotemporal data integration using distributed query relaxation over heterogeneous observational datasets
(Colorado State University. Libraries, 2018) Mitra, Saptashwa, author; Pallickara, Sangmi Lee, advisor; Pallickara, Shrideep, committee member; Li, Kaigang, committee member
Combining data from disparate sources enhances the opportunity to explore different aspects of the phenomena under consideration. However, there are several challenges in doing so effectively that include inter alia, the heterogeneity in data representation and format, collection patterns, and integration of foreign data attributes in a ready-to-use condition. In this study, we propose a scalable query-oriented data integration framework that provides estimations for spatiotemporally aligned data points. We have designed Confluence, a distributed data integration framework that dynamically generates accurate interpolations for the targeted spatiotemporal scopes along with an estimate of the uncertainty involved with such estimation. Confluence orchestrates computations to evaluate spatial and temporal query joins and to interpolate values. Our methodology facilitates distributed query evaluations with a dynamic relaxation of query constraints. Query evaluations are locality-aware and we leverage model-based dynamic parameter selection to provide accurate estimation for data points. We have included empirical benchmarks that profile the suitability of our approach in terms of accuracy, latency, and throughput at scale.
Open Access
Aperture: a system for interactive visualization of voluminous geospatial data
(Colorado State University. Libraries, 2020) Bruhwiler, Kevin, author; Pallickara, Shrideep, advisor; Pallickara, Sangmi Lee, advisor; Ghosh, Sudipto, committee member; Chandrasekaran, Venkatachalam, committee member
The growth in observational data volumes over the past decade has occurred alongside a need to make sense of the phenomena that underpin them. Visualization is a key component of the data wrangling process that precedes the analyses that informs these insights. The crux of this study is interactive visualizations of spatiotemporal phenomena from voluminous datasets. Spatiotemporal visualizations of voluminous datasets introduce challenges relating to interactivity, overlaying multiple datasets and dynamic feature selection, resource capacity constraints, and scaling. Our methodology to address these challenges relies on a novel mix of algorithms and systems innovations working in concert to ensure effective apportioning and amortization of workloads and enables interactivity during visualizations. In particular our research prototype, Aperture, leverages sketching algorithms, effective query predicate generation and evaluation, avoids performance hotspots, harnesses coprocessors for hardware acceleration, and convolutional neural network based encoders to render visualizations while preserving responsiveness and interactivity. Finally, we also explore issues in effective containerization to support visualization workloads. We also report on several empirical benchmarks that profile and demonstrate the suitability of our methodology to preserve interactivity while utilizing resources effectively to scale.
Open Access
Determining disease outbreak influence from voluminous epidemiology data on enhanced distributed graph-parallel system
(Colorado State University. Libraries, 2017) Shah, Naman, author; Pallickara, Sangmi Lee, advisor; Pallickara, Shrideep, committee member; Turk, Daniel E., committee member
Historically, catastrophe has resulted from large-scale epidemiological outbreaks in livestock populations. Efforts to prepare for these inevitable disasters are critical, and these efforts primarily involve the efficient use of limited available resources. Therefore, determining the relative influence of the entities involved in large-scale outbreaks is mandatory. Planning for outbreaks often involves executing compute-intensive disease spread simulations. To capture the probabilities of various outcomes, these simulations are executed several times over a collection of representative input scenarios, producing voluminous data. The resulting datasets contain valuable insights, including sequences of events that lead to extreme outbreaks. However, discovering and leveraging such information is also computationally expensive. This thesis proposes a distributed approach for aggregating and analyzing voluminous epidemiology data to determine the influential measure of the entities in a disease outbreak using the PageRank algorithm. Using the Disease Transmission Network (DTN) established in this research, planners or analysts can accomplish effective allocation of limited resources, such as vaccinations and field personnel, by observing the relative influential measure of the entities. To improve the performance of the analysis execution pipeline, an extension to the Apache Spark GraphX distributed graph-parallel system has been proposed.
Open Access
Embedding based clustering of time series data using dynamic time warping
(Colorado State University. Libraries, 2022) Mendis, R. A. C. Laksheen, author; Pallickara, Sangmi Lee, advisor; Pallickara, Shrideep, committee member; Hayne, Stephen, committee member
Voluminous time-series observational data impose challenges pertaining to storage and analytics. Identifying patterns in such climate time-series data is critical for many geospatial applications. Over the recent years, clustering has become a key computational technique for identifying patterns/clusters. However, data with complex structures and high dimensions could lead to uninformative clusters and hinder the quality of clustering. In this research, we use the state-of-the-art autoencoders with LSTMs, Bidirectional LSTMs and GRUs to learn highly non-linear mapping functions by training the networks with subsequences of timeseries to perform data reconstruction. Next, we extract the trained encoders to generate embeddings which are lightweight. These embeddings are more space efficient than the original time series data and require less computational power and resources for further processing. In the final step of clustering, instead of using common distance-based metrics like Euclidean distance, we use DTW, an algorithm for computing similarity between time series by ignoring variations in speed, to calculate similarity between the embeddings during the application of k- Means algorithm. Based on Silhouette score, this method generates clusters which are better than other reduction techniques.
Open Access
Enabling autoscaling for in-memory storage in cluster computing framework
(Colorado State University. Libraries, 2019) Shrestha, Bibek Raj, author; Pallickara, Sangmi Lee, advisor; Pallickara, Shrideep, committee member; Hayne, Stephen C., committee member
IoT enabled devices and observational instruments continuously generate voluminous data. A large portion of these datasets are delivered with the associated geospatial locations. The increased volumes of geospatial data, alongside the emerging geospatial services, pose computational challenges for large-scale geospatial analytics. We have designed and implemented STRETCH , an in-memory distributed geospatial storage that preserves spatial proximity and enables proactive autoscaling for frequently accessed data. STRETCH stores data with a delayed data dispersion scheme that incrementally adds data nodes to the storage system. We have devised an autoscaling feature that proactively repartitions data to alleviate computational hotspots before they occur. We compared the performance of S TRETCH with Apache Ignite and the results show that STRETCH provides up to 3 times the throughput when the system encounters hotspots. STRETCH is built on Apache Spark and Ignite and interacts with them at runtime.
Open Access
Harnessing spatiotemporal data characteristics to facilitate large-scale analytics over voluminous, high-dimensional observational datasets
(Colorado State University. Libraries, 2021) Rammer, Daniel P., author; Pallickara, Shrideep, advisor; Pallickara, Sangmi Lee, advisor; Ghosh, Sudipto, committee member; Breidt, Jay, committee member
Spatiotemporal data volumes have increased exponentially alongside a need to extract knowledge from them. We propose a methodology, encompassing a suite of algorithmic and systems innovations, to accomplish spatiotemporal data analysis at scale. Our methodology partitions and distributes data to reconcile the competing pulls of dispersion and load balancing. The dispersion schemes are informed by designing distributed data structures to organize metadata in support of expressive query evaluations and high-throughput data retrievals. Targeted, sequential disk block accesses and data sketching techniques are leveraged for effective retrievals. We facilitate seamless integration into data processing frameworks and analytical engines by building compliance for the Hadoop Distributed File System. A refinement of our methodology supports memory-residency and dynamic materialization of data (or subsets thereof) as DataFrames, Datasets, and Resilient Distributed Datasets. These refinements are backed by speculative prefetching schemes that manage speed differentials across the data storage hierarchy. We extend the data-centric view of our methodology to orchestration of deep learning workloads while preserving accuracy and ensuring faster completion times. Finally, we assess the suitability of our methodology using diverse high-dimensional datasets, myriad model fitting algorithms (including ensemble methods and deep neural networks), and multiple data processing frameworks such as Hadoop, Spark, TensorFlow, and PyTorch.
Open Access
Leveraging ensembles: balancing timeliness and accuracy for model training over voluminous datasets
(Colorado State University. Libraries, 2020) Budgaga, Walid, author; Pallickara, Shrideep, advisor; Pallickara, Sangmi Lee, advisor; Ben-Hur, Asa, committee member; Breidt, F. Jay, committee member
As data volumes increase, there is a pressing need to make sense of the data in a timely fashion. Voluminous datasets are often high dimensional, with individual data points representing a vector of features. Data scientists fit models to the data—using all features or a subset thereof—and then use these models to inform their understanding of phenomena or make predictions. The performance of these analytical models is assessed based on their accuracy and ability to generalize on unseen data. Several existing frameworks can be used for drawing insights from voluminous datasets. However, there are some inefficiencies associated with these frameworks including scalability, limited applicability beyond a target domain, prolonged training times, poor resource utilization, and insufficient support for combining diverse model fitting algorithms. In this dissertation, we describe our methodology for scalable supervised learning over voluminous datasets. The methodology explores the impact of partitioning the feature space, building models over these partitioned subsets of the data, and their impact on training times and accuracy. Using our methodology, a practitioner can harness a mix of learning methods to build diverse models over the partitioned data. Rather than build a single, all-encompassing model we construct an ensemble of models trained independently over different portions of the dataset. In particular, we rely on concurrent and independent learning from different portions of the data space to overcome the issues relating to resource utilization and completion times associated with distributed training of a single model over the entire dataset. Our empirical benchmarks are performed using datasets from diverse domains, including epidemiology, music, and weather. These benchmarks demonstrate the suitability of our methodology for reducing training times while preserving accuracy in contrast to those obtained from a complex model trained on the entire dataset. In particular, our methodology utilizes resources effectively by amortizing I/O and CPU costs by relying on a distributed environment while ensuring a significant reduction of network traffic during training.
Open Access
Leveraging structural-context similarity of Wikipedia links to predict twitter user locations
(Colorado State University. Libraries, 2017) Huang, Chuanqi, author; Pallickara, Sangmi Lee, advisor; Pallickara, Shrideep, committee member; Hayne, Stephen C., committee member
Twitter is a widely used social media service. Several efforts have targeted understanding the patterns of information dissemination underlying this social network. A user's location is one of the most important information items relative to analyzing content. However, location information tends to be unavailable because most users do not (want to) include geo-tags in their tweets. To predict a user's location, existing approaches require voluminous training data sets of geo-tagged tweets. However, some of the characteristics of tweets, such as compact, non-traditional linguistic expressions, have posed significant challenges when applying model-fitting approaches. In this thesis, we propose a novel framework for predicting the location of a social media user by leveraging structural-context similarity over Wikipedia links. We measure SimRanks between pages over the Wikipedia dump dataset and build a knowledge base, mapping location information (e.g., cities and states) to related vocabularies along with the likelihood for these mappings. Our results evolve as the users' tweet stream grows. We have implemented this framework using Apache Storm to observe real-time tweets. Finally, our framework provides a list of ranked "probable" cities based on the distances between candidate locations and their weights. This thesis includes empirical evaluations that demonstrate performance that is in line with current state-of-the-art location prediction approaches.
Open Access
Low-latency, query-driven analytics over voluminous multidimensional, spatiotemporal datasets
(Colorado State University. Libraries, 2017) Malensek, Matthew, author; Pallickara, Shrideep, advisor; Pallickara, Sangmi Lee, advisor; Bohm, A. P. Willem, committee member; Draper, Bruce, committee member; Breidt, F. Jay, committee member
Ubiquitous data collection from sources such as remote sensing equipment, networked observational devices, location-based services, and sales tracking has led to the accumulation of voluminous datasets; IDC projects that by 2020 we will generate 40 zettabytes of data per year, while Gartner and ABI estimate 20-35 billion new devices will be connected to the Internet in the same time frame. The storage and processing requirements of these datasets far exceed the capabilities of modern computing hardware, which has led to the development of distributed storage frameworks that can scale out by assimilating more computing resources as necessary. While challenging in its own right, storing and managing voluminous datasets is only the precursor to a broader field of study: extracting knowledge, insights, and relationships from the underlying datasets. The basic building block of this knowledge discovery process is analytic queries, encompassing both query instrumentation and evaluation. This dissertation is centered around query-driven exploratory and predictive analytics over voluminous, multidimensional datasets. Both of these types of analysis represent a higher-level abstraction over classical query models; rather than indexing every discrete value for subsequent retrieval, our framework autonomously learns the relationships and interactions between dimensions in the dataset (including time series and geospatial aspects), and makes the information readily available to users. This functionality includes statistical synopses, correlation analysis, hypothesis testing, probabilistic structures, and predictive models that not only enable the discovery of nuanced relationships between dimensions, but also allow future events and trends to be predicted. This requires specialized data structures and partitioning algorithms, along with adaptive reductions in the search space and management of the inherent trade-off between timeliness and accuracy. The algorithms presented in this dissertation were evaluated empirically on real-world geospatial time-series datasets in a production environment, and are broadly applicable across other storage frameworks.
Open Access
Prediction based scaling in a distributed stream processing cluster
(Colorado State University. Libraries, 2020) Khurana, Kartik, author; Pallickara, Sangmi Lee, advisor; Pallickara, Shrideep, committee member; Carter, Ellison, committee member
Proliferation of IoT sensors and applications have enabled us to monitor and analyze scientific and social phenomena with continuously arriving voluminous data. To provide real-time processing capabilities over streaming data, distributed stream processing engines (DSPEs) such as Apache STORM and Apache FLINK have been widely deployed. These frameworks support computations over large-scale, high frequency streaming data. However, current on-demand auto-scaling features in these systems may result in an inefficient resource utilization which is closely related to cost effectiveness in popular cloud-based computing environments. We propose ARSTREAM, an auto-scaling computing environment that manages fluctuating throughputs for data from sensor networks, while ensuring efficient resource utilization. We have built an Artificial Neural Network model for predicting data processing queues and this model captures non-linear relationships between data arrival rates, resource utilization, and the size of data processing queue. If a bottleneck is predicted, ARSTREAM scales-out the current cluster automatically for current jobs without halting them at the user level. In addition, ARSTREAM incorporates threshold-based re-balancing to minimize data loss during extreme peak traffic that could not be predicted by our model. Our empirical benchmarks show that ARSTREAM forecasts data processing queue sizes with RMSE of 0.0429 when tested on real-time data.
Open Access
Toward effective high-throughput georeferencing over voluminous observational data in the domain of precision agriculture
(Colorado State University. Libraries, 2018) Roselius, Maxwell L., author; Pallickara, Sangmi Lee, advisor; Pallickara, Shrideep, committee member; McKay, John, committee member
Remote sensing of plant traits and their environment facilitates non-invasive, high-throughput monitoring of the plant's physiological characteristics. Effective ingestion of these sensing data into a storage subsystem while georeferencing phenotyping setups is key to providing timely access to scientists and modelers. In this thesis, we propose a high-throughput distributed data ingestion framework with support for fine-grained georeferencing. The methodology includes a novel spatial indexing scheme, the nested hash grid, for fine-grained georeferencing of data while conserving memory footprints and ensuring acceptable latency. We include empirical evaluations performed on a commodity machine cluster with up to 1TB of data. The benchmarks demonstrate the efficacy of our approach.
Open Access
Towards generating a pre-training image transformer framework for preserving spatio-spectral properties in hyperspectral satellite images
(Colorado State University. Libraries, 2024) Faruk, Tanjim Bin, author; Pallickara, Sangmi Lee, advisor; Pallickara, Shrideep, advisor; Cotrufo, M. Francesca, committee member
Hyperspectral images facilitate advanced geospatial analysis without the need for expensive ground surveys. Machine learning approaches are particularly well-suited for handling the geospatial coverage required by these applications. While self-supervised learning is a promising methodology for managing voluminous datasets with limited labels, existing encoders in self-supervised learning face challenges when applied to hyperspectral images due to the large number of spectral channels. We propose a novel hyperspectral image encoding framework designed to generate highly representative embeddings for subsequent geospatial analysis. Our framework extends the Vision Transformer model with dynamic masking strategies to enhance model performance in regions with high spatial variability. We introduce a novel loss function that incorporates spectral quality metrics and employs the unique channel grouping strategy to leverage spectral similarity across channels. We demonstrate the effectiveness of our approach through a downstream model for estimating soil texture at a 30-meter resolution.
Open Access
Towards interactive betweenness centrality estimation for transportation network using capsule network
(Colorado State University. Libraries, 2022) Matin, Abdul, author; Pallickara, Sangmi Lee, advisor; Pallickara, Shrideep, committee member; Bhaskar, Aditi S., committee member
The node importance of a graph needs to be estimated for many graph-based applications. One of the most popular metrics for measuring node importance is betweenness centrality, which measures the amount of influence a node has over the flow of information in a graph. However, the computation complexity of calculating betweenness centrality is extremely high with large- scale graphs. This is especially true when analyzing the road networks of states with millions of nodes and edges, making it infeasible to calculate their betweenness centrality (BC) in real- time using traditional iterative methods. The application of a machine learning model to predict the importance of nodes provides opportunities to address this issue. Graph Neural Networks (GNNs), which have been gaining popularity in recent years, are particularly well-suited for graph analysis. In this study, we propose a deep learning architecture RoadCaps to estimate the BC by merging Capsule Neural Networks with Graph Convolutional Networks (GCN), a convolution operation based GNN. We target the effective aggregation of features from neighbor nodes to approximate the correct BC of a node. We leverage patterns capturing the strength of the capsule network to effectively estimate the node level BC from the high-level information generated by the GCN block. We further compare the model accuracy and effectiveness of RoadCaps with the other two GCN-based models. We also analyze the efficiency and effectiveness of RoadCaps for different aspects like scalability and robustness. We perform one empirical benchmark with the road network for the entire state of California. The overall analysis shows that our proposed network can provide more accurate road importance estimation, which is helpful for rapid response planning such as evacuation during wildfires and flooding.