A framework to support distributional similarity analysis over arbitrary spatiotemporal scopes at scale
| dc.contributor.author | Hansen, Paige, author | |
| dc.contributor.author | Pallickara, Shrideep, advisor | |
| dc.contributor.author | Lee Pallickara, Sangmi, advisor | |
| dc.contributor.author | Arabi, Mazdak, committee member | |
| dc.date.accessioned | 2026-01-12T11:27:39Z | |
| dc.date.issued | 2025 | |
| dc.description.abstract | Our methodology leverages a mix of statistical, algorithmic, and systems techniques to enable efficient and memory-resident similarity analysis. Rather than relying on a fixed metric, similarity thresholds are derived from the characteristics of each variable, allowing the measure to remain sensitive to intra-dataset variation. We employ the Jensen–Shannon divergence for its symmetry and boundedness and we also summarize probability density functions as compact 4-tuples that allow navigation through extents based on their degree of similarity. A refinement of this representation further allows differential scaling across dimensions to extend the scope of analysis. The contributions of this thesis are threefold. First, we compute variable-specific thresholds that adapt similarity scoring to the distributional features of the data. Second, we introduce a novel distance-based measure that prunes the search space without compromising accuracy. Third, we demonstrate the ability to perform distributional analyses, both comprehensive and interactive, across arbitrary spatiotemporal scopes, with near real-time calculation of thresholds and similarity estimates. Our empirical benchmarks, with multivariate datasets spanning 50 years of complex, evolving climate phenomena, validate these design choices and underscore the suitability of the methodology for large-scale, longitudinal datasets. Our methodology results in three orders of magnitude speedup over Apache Druid, which is a leading framework for distributional analysis at scale. | |
| dc.format.medium | born digital | |
| dc.format.medium | masters theses | |
| dc.identifier | Hansen_colostate_0053N_19286.pdf | |
| dc.identifier.uri | https://hdl.handle.net/10217/242670 | |
| dc.identifier.uri | https://doi.org/10.25675/3.025562 | |
| dc.language | English | |
| dc.language.iso | eng | |
| dc.publisher | Colorado State University. Libraries | |
| dc.relation.ispartof | 2020- | |
| dc.rights | Copyright and other restrictions may apply. User is responsible for compliance with all applicable laws. For information about copyright law, please see https://libguides.colostate.edu/copyright. | |
| dc.subject | declarative queries | |
| dc.subject | spatiotemporally evolving phenomena | |
| dc.subject | distributional similarity | |
| dc.subject | big data | |
| dc.title | A framework to support distributional similarity analysis over arbitrary spatiotemporal scopes at scale | |
| dc.type | Text | |
| dc.type | Image | |
| dcterms.rights.dpla | This Item is protected by copyright and/or related rights (https://rightsstatements.org/vocab/InC/1.0/). You are free to use this Item in any way that is permitted by the copyright and related rights legislation that applies to your use. For other uses you need to obtain permission from the rights-holder(s). | |
| thesis.degree.discipline | Computer Science | |
| thesis.degree.grantor | Colorado State University | |
| thesis.degree.level | Masters | |
| thesis.degree.name | Master of Science (M.S.) |
Files
Original bundle
1 - 1 of 1
Loading...
- Name:
- Hansen_colostate_0053N_19286.pdf
- Size:
- 264.87 KB
- Format:
- Adobe Portable Document Format
