Browsing by Author "Pallickara, Sangmi, advisor"
Now showing 1 - 6 of 6
Results Per Page
Sort Options
Item Open Access Containerization of model fitting workloads over spatial datasets(Colorado State University. Libraries, 2021) Warushavithana, Menuka, author; Pallickara, Shrideep, advisor; Pallickara, Sangmi, advisor; Breidt, Jay, committee memberSpatial data volumes have grown exponentially over the past several years. The number of domains in which spatial data are extensively leveraged include atmospheric sciences, environmental monitoring, ecological modeling, epidemiology, sociology, commerce, and social media among others. These data are often used to understand phenomena and inform decision making by fitting models to them. In this study, we present our methodology to fit models at scale over spatial data. Our methodology encompasses segmentation, spatial similarity based on the dataset(s) under consideration, and transfer learning schemes that are informed by the spatial similarity to train models faster while utilizing fewer resources. We consider several model fitting algorithms and execution within containerized environments as we profile the suitability of our methodology. Our benchmarks validate the suitability of our methodology to facilitate faster, resource-efficient training of models over spatial data.Item Open Access GeoLens: enabling interactive visual analytics over large-scale, multidimensional geospatial datasets(Colorado State University. Libraries, 2015) Koontz, Jared, author; Pallickara, Sangmi, advisor; Pallickara, Shrideep, committee member; Schumacher, Russ, committee memberWith the rapid increase of scientific data volumes, interactive tools that enable effective visual representation for scientists are needed. This is critical when scientists are manipulating voluminous datasets and especially when they need to explore datasets interactively to develop their hypotheses. In this paper, we present an interactive visual analytics framework, GeoLens. GeoLens provides fast and expressive interactions with voluminous geospatial datasets. We provide an expressive visual query evaluation scheme to support advanced interactive visual analytics technique, such as brushing and linking. To achieve this, we designed and developed the geohash based image tile generation algorithm that automatically adjusts the range of data to access based on the minimum acceptable size of the image tile. In addition, we have also designed an autonomous histogram generation algorithm that generates histograms of user-defined data subsets that do not have pre-computed data properties. Using our approach, applications can generate histograms of datasets containing millions of data points with sub-second latency. The work builds on our visual query coordinating scheme that evaluates geospatial query and orchestrates data aggregation in a distributed storage environment while preserving data locality and minimizing data movements. This paper includes empirical benchmarks of our framework encompassing a billion-file dataset published by the National Climactic Data Center.Item Open Access Identification and characterization of super-spreaders from voluminous epidemiology data(Colorado State University. Libraries, 2016) Shah, Harshil, author; Pallickara, Shrideep, advisor; Pallickara, Sangmi, advisor; Breidt, F. Jay, committee memberPlanning for large-scale epidemiological outbreaks often involves executing compute-intensive disease spread simulations. To capture the probabilities of various outcomes, these simulations are executed several times over a collection of representative input scenarios, producing voluminous data. The resulting datasets contain valuable insights, including sequences of events such as super-spreading events that lead to extreme outbreaks. However, discovering and leveraging such information is also computationally expensive. In this thesis, we propose a distributed approach for analyzing voluminous epidemiology data to locate and classify the super-spreaders in a disease network. Our methodology constructs analytical models using features extracted from the epidemiology data. The analytical models are amenable to interpretation and disease planners can use them to inform identification of super-spreaders that have a disproportionate effect on epidemiological outcomes, enabling effective allocation of limited resources such as vaccinations and field personnel.Item Open Access On the use of locality aware distributed hash tables for homology searches over voluminous biological sequence data(Colorado State University. Libraries, 2015) Tolooee, Cameron, author; Pallickara, Sangmi, advisor; Ben-Hur, Asa, committee member; von Fischer, Joseph, committee memberRapid advances in genomic sequencing technology have resulted in a data deluge in biology and bioinformatics. This increase in data volumes has introduced computational challenges for frequently performed sequence analytics routines such as DNA and protein homology searches; these must also preferably be done in real-time. This thesis proposes a scalable and similarity-aware distributed storage framework, Mendel, that enables retrieval of biologically significant DNA and protein alignments against a voluminous genomic sequence database. Mendel fragments the sequence data and generates an inverted-index, which is then dispersed over a distributed collection of machines using a locality aware distributed hash table. A novel distributed nearest neighbor search algorithm identifies sequence segments with high similarity and splices them together to form an alignment. This paper includes an empirical evaluation of the performance, sensitivity, and scalability of the proposed system over the NCBI's non-redundant protein dataset. In these benchmarks, Mendel demonstrates higher sensitivity and faster query evaluations when compared to other modern frameworks.Item Open Access Towards federated learning over large-scale streaming data(Colorado State University. Libraries, 2020) Pereira, Aaron, author; Pallickara, Sangmi, advisor; Pallickara, Shrideep, committee member; Zahran, Sammy, committee memberDistributed Stream Processing Engines (DSPEs) have seen significant deployment growth along with an increase in streaming data sources such as sensor networks. These DSPEs enable processing large amounts of streaming data in a cluster of commodity machines to extract knowledge and insights in real-time. Due to fluctuating data arrival rates in real-world applications, modern DSPEs often provide auto-scaling. However, the existing designs of advanced analytical frameworks are not effectively aligned with scalable streaming computing environments. We have designed and developed ORCA, a federated learning architecture that supports the training of traditional Artificial Neural Networks as well as Convolutional Neural Networks and Long Short-term Memory Network based models while ensuring resiliency during scaling. ORCA also introduces dynamic adjustment of the 'elasticity' hyper-parameter for rescaled computing environments. We estimate this elasticity hyper-parameter using reinforcement learning. Our empirical benchmarks show that ORCA is capable of achieving an MSE of 0.038 over real-world streaming datasets.Item Open Access Transformer, diffusion, and GAN-based augmentations for contrastive learning of visual representations(Colorado State University. Libraries, 2024) Armstrong, Samuel, author; Pallickara, Sangmi, advisor; Pallickara, Shrideep, advisor; Ghosh, Sudipto, committee member; Breidt, F. Jay, committee memberGenerative modeling and self-supervised learning have emerged as two of the most prominent fields of study in machine learning in recent years. Generative models are able to learn detailed visual representations that can then be used to generate synthetic data. Modern self-supervised learning methods are able to extract high-level visual information from images in an unsupervised manner and then apply this information to downstream tasks such as object detection and segmentation. As generative models become more and more advanced, we want to be able to extract their learned knowledge and then apply it to downstream tasks. In this work, we develop Generative Contrastive Learning (GCL), a methodology that uses contrastive learning to extract information from modern generative models. We define GCL's high-level components: an encoder, feature map augmenter, decoder, handcrafted augmenter, and contrastive learning model and demonstrate how to apply GCL to the three major types of large generative models: GANs, Diffusion Models, and Image Transformers. Due to the complex nature of generative models and the near-infinite number of unique images they can produce, we have developed several methodologies to synthesize images in a manner that compliments the augmentation-based learning that is used in contrastive learning frameworks. Our work shows that applying these large generative models to self-supervised learning can be done in a computationally viable manner without the use of large clusters of high-performance GPUs. Finally, we show the clear benefit of leveraging generative models in a contrastive learning setting using standard self-supervised learning benchmarks.