Repository logo
 

Harnessing spatiotemporal data characteristics to facilitate large-scale analytics over voluminous, high-dimensional observational datasets

dc.contributor.authorRammer, Daniel P., author
dc.contributor.authorPallickara, Shrideep, advisor
dc.contributor.authorPallickara, Sangmi Lee, advisor
dc.contributor.authorGhosh, Sudipto, committee member
dc.contributor.authorBreidt, Jay, committee member
dc.date.accessioned2022-01-07T11:29:57Z
dc.date.available2022-01-07T11:29:57Z
dc.date.issued2021
dc.description.abstractSpatiotemporal data volumes have increased exponentially alongside a need to extract knowledge from them. We propose a methodology, encompassing a suite of algorithmic and systems innovations, to accomplish spatiotemporal data analysis at scale. Our methodology partitions and distributes data to reconcile the competing pulls of dispersion and load balancing. The dispersion schemes are informed by designing distributed data structures to organize metadata in support of expressive query evaluations and high-throughput data retrievals. Targeted, sequential disk block accesses and data sketching techniques are leveraged for effective retrievals. We facilitate seamless integration into data processing frameworks and analytical engines by building compliance for the Hadoop Distributed File System. A refinement of our methodology supports memory-residency and dynamic materialization of data (or subsets thereof) as DataFrames, Datasets, and Resilient Distributed Datasets. These refinements are backed by speculative prefetching schemes that manage speed differentials across the data storage hierarchy. We extend the data-centric view of our methodology to orchestration of deep learning workloads while preserving accuracy and ensuring faster completion times. Finally, we assess the suitability of our methodology using diverse high-dimensional datasets, myriad model fitting algorithms (including ensemble methods and deep neural networks), and multiple data processing frameworks such as Hadoop, Spark, TensorFlow, and PyTorch.
dc.format.mediumborn digital
dc.format.mediumdoctoral dissertations
dc.identifierRammer_colostate_0053A_16814.pdf
dc.identifier.urihttps://hdl.handle.net/10217/234231
dc.languageEnglish
dc.language.isoeng
dc.publisherColorado State University. Libraries
dc.relation.ispartof2020-
dc.rightsCopyright and other restrictions may apply. User is responsible for compliance with all applicable laws. For information about copyright law, please see https://libguides.colostate.edu/copyright.
dc.subjectdistributed analytics
dc.subjectspatiotemporal
dc.subjectdistributed systems
dc.subjectbig data
dc.titleHarnessing spatiotemporal data characteristics to facilitate large-scale analytics over voluminous, high-dimensional observational datasets
dc.typeText
dcterms.rights.dplaThis Item is protected by copyright and/or related rights (https://rightsstatements.org/vocab/InC/1.0/). You are free to use this Item in any way that is permitted by the copyright and related rights legislation that applies to your use. For other uses you need to obtain permission from the rights-holder(s).
thesis.degree.disciplineComputer Science
thesis.degree.grantorColorado State University
thesis.degree.levelDoctoral
thesis.degree.nameDoctor of Philosophy (Ph.D.)

Files

Original bundle
Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
Rammer_colostate_0053A_16814.pdf
Size:
2.23 MB
Format:
Adobe Portable Document Format