Repository logo
 

Modern considerations for the use of naive Bayes in the supervised classification of genetic sequence data

Date

2021

Authors

Lakin, Steven M., author
Abdo, Zaid, advisor
Rajopadhye, Sanjay, committee member
Stenglein, Mark, committee member
Stewart, Jane, committee member

Journal Title

Journal ISSN

Volume Title

Abstract

Genetic sequence classification is the task of assigning a known genetic label to an unknown genetic sequence. Often, this is the first step in genetic sequence analysis and is critical to understanding data produced by molecular techniques like high throughput sequencing. Here, we explore an algorithm called naive Bayes that was historically successful in classifying 16S ribosomal gene sequences for microbiome analysis. We extend the naive Bayes classifier to perform the task of general sequence classification by leveraging advancements in computational parallelism and the statistical distributions that underlie naive Bayes. In Chapter 2, we show that our implementation of naive Bayes, called WarpNL, performs within a margin of error of modern classifiers like Kraken2 and local alignment. We discuss five crucial aspects of genetic sequence classification and show how these areas affect classifier performance: the query data, the reference sequence database, the feature encoding method, the classification algorithm, and access to computational resources. In Chapter 3, we cover the critical computational advancements introduced in WarpNL that make it efficient in a modern computing framework. This includes efficient feature encoding, introduction of a log-odds ratio for comparison of naive Bayes posterior estimates, description of schema for parallel and distributed naive Bayes architectures, and use of machine learning classifiers to perform outgroup sequence classification. Finally in Chapter 4, we explore a variant of the Dirichlet multinomial distribution that underlies the naive Bayes likelihood, called the beta-Liouville multinomial. We show that the beta-Liouville multinomial can be used to enhance classifier performance, and we provide mathematical proofs regarding its convergence during maximum likelihood estimation. Overall, this work explores the naive Bayes algorithm in a modern context and shows that it is competitive for genetic sequence classification.

Description

Rights Access

Subject

genetic sequence classification
naive Bayes
supervised classification
genomics
bioinformatics
natural language processing

Citation

Associated Publications