Repository logo
 

A framework for real-time, autonomous anomaly detection over voluminous time-series geospatial data streams

Date

2014

Authors

Budgaga, Walid, author
Pallickara, Shrideep, advisor
Pallickara, Sangmi Lee, advisor
Ben-Hur, Asa, committee member
Schumacher, Russ, committee member

Journal Title

Journal ISSN

Volume Title

Abstract

In this research work we present an approach encompassing both algorithm and system design to detect anomalies in data streams. Individual observations within these streams are multidimensional, with each dimension corresponding to a feature of interest. We consider time-series geospatial datasets generated by remote and in situ observational devices. Three aspects make this problem particularly challenging: (1) the cumulative volume and rates of data arrivals, (2) anomalies evolve over time, and (3) there are spatio-temporal correlations associated with the data. Therefore, anomaly detections must be accurate and performed in real time. Given the data volumes involved, solutions must minimize user intervention and be amenable to distributed processing to ensure scalability. Our approach achieves accurate, high throughput classications in real time. We rely on Expectation Maximization (EM) to build Gaussian Mixture Models (GMMs) that model the densities of the training data. Rather than one all-encompassing model, our approach involves multiple model instances, each of which is responsible for a particular geographical extent and can also adapt as data evolves. We have incorporated these algorithms into our distributed storage platform, Galileo, and proled their suitability through empirical analysis which demonstrates high throughput (10,000 observations per-second, per-node) and low latency on real-world datasets.

Description

Rights Access

Subject

big data
clustering
data streams
distributed system
online anomaly detection
time series analytics

Citation

Associated Publications