Harnessing spatiotemporal data characteristics to facilitate large-scale analytics over voluminous, high-dimensional observational datasets

Rammer, Daniel P., author; Pallickara, Shrideep, advisor; Pallickara, Sangmi Lee, advisor; Ghosh, Sudipto, committee member; Breidt, Jay, committee member

Harnessing spatiotemporal data characteristics to facilitate large-scale analytics over voluminous, high-dimensional observational datasets

dc.contributor.author	Rammer, Daniel P., author
dc.contributor.author	Pallickara, Shrideep, advisor
dc.contributor.author	Pallickara, Sangmi Lee, advisor
dc.contributor.author	Ghosh, Sudipto, committee member
dc.contributor.author	Breidt, Jay, committee member
dc.date.accessioned	2022-01-07T11:29:57Z
dc.date.available	2022-01-07T11:29:57Z
dc.date.issued	2021
dc.description.abstract	Spatiotemporal data volumes have increased exponentially alongside a need to extract knowledge from them. We propose a methodology, encompassing a suite of algorithmic and systems innovations, to accomplish spatiotemporal data analysis at scale. Our methodology partitions and distributes data to reconcile the competing pulls of dispersion and load balancing. The dispersion schemes are informed by designing distributed data structures to organize metadata in support of expressive query evaluations and high-throughput data retrievals. Targeted, sequential disk block accesses and data sketching techniques are leveraged for effective retrievals. We facilitate seamless integration into data processing frameworks and analytical engines by building compliance for the Hadoop Distributed File System. A refinement of our methodology supports memory-residency and dynamic materialization of data (or subsets thereof) as DataFrames, Datasets, and Resilient Distributed Datasets. These refinements are backed by speculative prefetching schemes that manage speed differentials across the data storage hierarchy. We extend the data-centric view of our methodology to orchestration of deep learning workloads while preserving accuracy and ensuring faster completion times. Finally, we assess the suitability of our methodology using diverse high-dimensional datasets, myriad model fitting algorithms (including ensemble methods and deep neural networks), and multiple data processing frameworks such as Hadoop, Spark, TensorFlow, and PyTorch.
dc.format.medium	born digital
dc.format.medium	doctoral dissertations
dc.identifier	Rammer_colostate_0053A_16814.pdf
dc.identifier.uri	https://hdl.handle.net/10217/234231
dc.language	English
dc.language.iso	eng
dc.publisher	Colorado State University. Libraries
dc.relation.ispartof	2020-
dc.rights	Copyright and other restrictions may apply. User is responsible for compliance with all applicable laws. For information about copyright law, please see https://libguides.colostate.edu/copyright.
dc.subject	distributed analytics
dc.subject	spatiotemporal
dc.subject	distributed systems
dc.subject	big data
dc.title	Harnessing spatiotemporal data characteristics to facilitate large-scale analytics over voluminous, high-dimensional observational datasets
dc.type	Text
dcterms.rights.dpla	This Item is protected by copyright and/or related rights (https://rightsstatements.org/vocab/InC/1.0/). You are free to use this Item in any way that is permitted by the copyright and related rights legislation that applies to your use. For other uses you need to obtain permission from the rights-holder(s).
thesis.degree.discipline	Computer Science
thesis.degree.grantor	Colorado State University
thesis.degree.level	Doctoral
thesis.degree.name	Doctor of Philosophy (Ph.D.)

Files

Original bundle

Now showing 1 - 1 of 1

Name:: Rammer_colostate_0053A_16814.pdf
Size:: 2.23 MB
Format:: Adobe Portable Document Format

Download

Collections

2020-
Theses and Dissertations