Repository logo

Efficient exploration of diverse datasets through harmonization of encodings and representations

Abstract

Voluminous spatiotemporal data generation has occurred alongside a tremendous diversity in the data. Data reconciliation and harmonization, including units of measurement, are a precursor to efficient downstream analysis. However, the diversity of encoding formats, spatial and coordinate referencing systems, types of data (points, shapes, rasters/grids), their volumes, and variety of data storage strategies can stymie data analyses. The complexity is further exacerbated by the fact that datasets are often "layered" i.e., we consider more than one dataset during analyses. Without systematic harmonization, valuable insights can remain inaccessible, locked away by technical incompatibilities. In this study, we describe our methodology to not just harmonize such datasets but include support for layering them into federated datasets alongside an ecosystem of services including dimensionality reduction, query evaluations, correlation analysis, normalization, and visualization. Together these capabilities allow researchers to move from raw, fragmented data toward integrated, interpretable results with significantly reduced friction during analyses. These services are amenable to daisy chaining, operate on distributed datasets, integrate with established distributed approaches, and scale. Our benchmarks contrast performance with systems such as Spark and Sedona.

Description

Rights Access

Embargo expires: 01/07/2027.

Subject

data wrangling
harmonization
visual analytics
federation
big data
metadata

Citation

Endorsement

Review

Supplemented By

Referenced By