Repository logo

A framework to support distributional similarity analysis over arbitrary spatiotemporal scopes at scale

Abstract

Our methodology leverages a mix of statistical, algorithmic, and systems techniques to enable efficient and memory-resident similarity analysis. Rather than relying on a fixed metric, similarity thresholds are derived from the characteristics of each variable, allowing the measure to remain sensitive to intra-dataset variation. We employ the Jensen–Shannon divergence for its symmetry and boundedness and we also summarize probability density functions as compact 4-tuples that allow navigation through extents based on their degree of similarity. A refinement of this representation further allows differential scaling across dimensions to extend the scope of analysis. The contributions of this thesis are threefold. First, we compute variable-specific thresholds that adapt similarity scoring to the distributional features of the data. Second, we introduce a novel distance-based measure that prunes the search space without compromising accuracy. Third, we demonstrate the ability to perform distributional analyses, both comprehensive and interactive, across arbitrary spatiotemporal scopes, with near real-time calculation of thresholds and similarity estimates. Our empirical benchmarks, with multivariate datasets spanning 50 years of complex, evolving climate phenomena, validate these design choices and underscore the suitability of the methodology for large-scale, longitudinal datasets. Our methodology results in three orders of magnitude speedup over Apache Druid, which is a leading framework for distributional analysis at scale.

Description

Rights Access

Subject

declarative queries
spatiotemporally evolving phenomena
distributional similarity
big data

Citation

Endorsement

Review

Supplemented By

Referenced By