Automating investigative pattern detection using machine learning & graph pattern matching techniques
Date
2022
Authors
Muramudalige, Shashika R., author
Jayasumana, Anura P., advisor
Ray, Indrakshi, committee member
Kim, Ryan G., committee member
Wang, Haonan, committee member
Journal Title
Journal ISSN
Volume Title
Abstract
Identification and analysis of latent and emergent behavioral patterns are core tasks in investigative domains such as homeland security, counterterrorism, and crime prevention. Development of behavioral trajectory models associated with radicalization and tracking individuals and groups based on such trajectories are critical for law enforcement investigations, but these are hampered by sheer volume and nature of data that need to be mined and processed. Dynamic and complex behaviors of extremists and extremist groups, missing or incomplete information, and lack of intelligent tools further obstruct counterterrorism efforts. Our research is aimed at developing state-of-the-art computational tools while building on recent advances in machine learning, natural language processing (NLP), and graph databases. In this work, we address the challenges of investigative pattern detection by developing algorithms, tools, and techniques primarily aimed at behavioral pattern tracking and identification for domestic radicalization. The methods developed are integrated in a framework, Investigative Pattern Detection Framework for Counterterrorism (INSPECT). INSPECT includes components for extracting information using NLP techniques, information networks to store in appropriate databases while enabling investigative graph searches, and data synthesis via generative adversarial techniques to overcome limitations due to incomplete and sparse data. These components enable streamlining investigative pattern detection while accommodating various use cases and datasets. While our outcomes are beneficial for law enforcement and counterterrorism applications to counteract the threat of violent extremism, as the results presented demonstrate, the proposed framework is adaptable to diverse behavioral pattern analysis domains such as consumer analytics, cybersecurity, and behavioral health. Information on radicalization activity and participant profiles of interest to investigative tasks are mostly found in disparate text sources. We integrate NLP approaches such as named entity recognition (NER), coreference resolution, and multi-label text classification to extract structured information regarding behavioral indicators, temporal details, and other metadata. We further use multiple text pre-processing approaches to improve the accuracy of data extraction. Our training text datasets are intrinsically smaller and label-wise imbalanced, which hinders direct application of NLP techniques for better results. We use a transfer learning-based, pre-trained NLP model by integrating our specific datasets and achieve noteworthy improvement in information extraction. The extracted information from text sources represents a rich knowledge network of populations with various types of connections that needs to be stored, updated, and repeatedly inspected for emergence of patterns in the long term. Therefore, we utilize graph databases as the foremost storage option while maintaining the reliability and scalability of behavioral data processing. To query suspicious and vulnerable individuals or groups, we implement investigative graph search algorithms as custom stored procedures on top of graph databases while verifying the ability to operate at scale. We use datasets in different contexts to demonstrate the wide-range applicability and the enhanced effectiveness of observing suspicious or latent trends using our investigative graph searches. Investigative data by nature is incomplete and sparse, and the number of cases that may be used for training investigators or machine learning algorithms is small. This is an inherent concern in investigative and many other contexts where the data collection is tedious, available data is limited and also may be subjected to privacy concerns. Having large datasets is beneficial to social scientists and investigative authorities to enhance their skills, and to achieve more accuracy and reliability. A not so small training data volume is also essential for application of the latest machine learning techniques for improved classification and detection. In this work, we propose a generative adversarial network (GAN) based approach with novel feature mapping techniques to synthesize additional data from a small and sparse data set while preserving the statistical characteristics. We also compare our proposed method with two likelihood approaches. i.e., multi-variate Gaussian and regular-vine copulas. We verify the robustness of the proposed technique via a simulation and real-world datasets representing diverse domains. The proposed GAN-based data generation approach is applicable to other domains as demonstrated with two applications. Initially, we extend our data generation approach by contributing to a computer security application resulting in improved phishing websites detection with synthesized datasets. We merge measured datasets with synthesized samples and re-train models to improve the performance of classification models and mitigate vulnerability against adversarial samples. The second was related to a video traffic classification application in which to the data sets are enhanced while preserving statistical similarity between the actual and synthesized datasets. For the video traffic data generation, we modified our data generation technique to capture the temporal patterns in time series data. In this application, we integrate a Wasserstein GAN (WGAN) by using different snapshots of the same video signal with feature-mapping techniques. A trace splitting algorithm is presented for training data of video traces that exhibit higher data throughput with high bursts at the beginning of the video session compared to the rest of the session. With synthesized data, we obtain 5 - 15% accuracy improvement for classification compared to only having actual traces. The INSPECT framework is validated primarily by mining detailed forensic biographies of known jihadists, which are extensively used by social/political scientists. Additionally, each component in the framework is extensively validated with a Human-In-The-Loop (HITL) process, which improves the reliability and accuracy of machine learning models, investigative graph algorithms, and other computing tools based on feedback from social scientists. The entire framework is embedded in a modular architecture where the analytical components are implemented independently and adjustable for different requirements and datasets. We verified the proposed framework's reliability, scalability, and generalizability with datasets in different domains. This research also makes a significant contribution to discrete and sparse data generation in diverse application domains with novel generative adversarial data synthesizing techniques.
Description
Rights Access
Subject
graph databases
investigative pattern detection
natural language processing
graph pattern matching
adversarial data generation
machine learning