Modern considerations for the use of naive Bayes in the supervised classification of genetic sequence data

Lakin, Steven M., author; Abdo, Zaid, advisor; Rajopadhye, Sanjay, committee member; Stenglein, Mark, committee member; Stewart, Jane, committee member

Modern considerations for the use of naive Bayes in the supervised classification of genetic sequence data

Files

Lakin_colostate_0053A_16504.pdf (2.61 MB)

Date

2021

Authors

Lakin, Steven M., author

Abdo, Zaid, advisor

Rajopadhye, Sanjay, committee member

Stenglein, Mark, committee member

Stewart, Jane, committee member

Abstract

Genetic sequence classification is the task of assigning a known genetic label to an unknown genetic sequence. Often, this is the first step in genetic sequence analysis and is critical to understanding data produced by molecular techniques like high throughput sequencing. Here, we explore an algorithm called naive Bayes that was historically successful in classifying 16S ribosomal gene sequences for microbiome analysis. We extend the naive Bayes classifier to perform the task of general sequence classification by leveraging advancements in computational parallelism and the statistical distributions that underlie naive Bayes. In Chapter 2, we show that our implementation of naive Bayes, called WarpNL, performs within a margin of error of modern classifiers like Kraken2 and local alignment. We discuss five crucial aspects of genetic sequence classification and show how these areas affect classifier performance: the query data, the reference sequence database, the feature encoding method, the classification algorithm, and access to computational resources. In Chapter 3, we cover the critical computational advancements introduced in WarpNL that make it efficient in a modern computing framework. This includes efficient feature encoding, introduction of a log-odds ratio for comparison of naive Bayes posterior estimates, description of schema for parallel and distributed naive Bayes architectures, and use of machine learning classifiers to perform outgroup sequence classification. Finally in Chapter 4, we explore a variant of the Dirichlet multinomial distribution that underlies the naive Bayes likelihood, called the beta-Liouville multinomial. We show that the beta-Liouville multinomial can be used to enhance classifier performance, and we provide mathematical proofs regarding its convergence during maximum likelihood estimation. Overall, this work explores the naive Bayes algorithm in a modern context and shows that it is competitive for genetic sequence classification.

Subject

genetic sequence classification

naive Bayes

supervised classification

genomics

bioinformatics

natural language processing

URI

https://hdl.handle.net/10217/232600

Collections

2020-
Theses and Dissertations

Full item page

Modern considerations for the use of naive Bayes in the supervised classification of genetic sequence data

Files

Date

Authors

Journal Title

Journal ISSN

Volume Title

Abstract

Description

Rights Access

Subject

Citation

URI

Associated Publications

Collections