Repository logo
 

On the use of locality aware distributed hash tables for homology searches over voluminous biological sequence data

Date

2015

Authors

Tolooee, Cameron, author
Pallickara, Sangmi, advisor
Ben-Hur, Asa, committee member
von Fischer, Joseph, committee member

Journal Title

Journal ISSN

Volume Title

Abstract

Rapid advances in genomic sequencing technology have resulted in a data deluge in biology and bioinformatics. This increase in data volumes has introduced computational challenges for frequently performed sequence analytics routines such as DNA and protein homology searches; these must also preferably be done in real-time. This thesis proposes a scalable and similarity-aware distributed storage framework, Mendel, that enables retrieval of biologically significant DNA and protein alignments against a voluminous genomic sequence database. Mendel fragments the sequence data and generates an inverted-index, which is then dispersed over a distributed collection of machines using a locality aware distributed hash table. A novel distributed nearest neighbor search algorithm identifies sequence segments with high similarity and splices them together to form an alignment. This paper includes an empirical evaluation of the performance, sensitivity, and scalability of the proposed system over the NCBI's non-redundant protein dataset. In these benchmarks, Mendel demonstrates higher sensitivity and faster query evaluations when compared to other modern frameworks.

Description

Rights Access

Subject

distributed system
homology search
sequence similarity search

Citation

Associated Publications