Modern considerations for the use of naive Bayes in the supervised classification of genetic sequence data

Lakin, Steven M., author; Abdo, Zaid, advisor; Rajopadhye, Sanjay, committee member; Stenglein, Mark, committee member; Stewart, Jane, committee member

Modern considerations for the use of naive Bayes in the supervised classification of genetic sequence data

dc.contributor.author	Lakin, Steven M., author
dc.contributor.author	Abdo, Zaid, advisor
dc.contributor.author	Rajopadhye, Sanjay, committee member
dc.contributor.author	Stenglein, Mark, committee member
dc.contributor.author	Stewart, Jane, committee member
dc.date.accessioned	2021-06-07T10:21:07Z
dc.date.available	2021-06-07T10:21:07Z
dc.date.issued	2021
dc.description.abstract	Genetic sequence classification is the task of assigning a known genetic label to an unknown genetic sequence. Often, this is the first step in genetic sequence analysis and is critical to understanding data produced by molecular techniques like high throughput sequencing. Here, we explore an algorithm called naive Bayes that was historically successful in classifying 16S ribosomal gene sequences for microbiome analysis. We extend the naive Bayes classifier to perform the task of general sequence classification by leveraging advancements in computational parallelism and the statistical distributions that underlie naive Bayes. In Chapter 2, we show that our implementation of naive Bayes, called WarpNL, performs within a margin of error of modern classifiers like Kraken2 and local alignment. We discuss five crucial aspects of genetic sequence classification and show how these areas affect classifier performance: the query data, the reference sequence database, the feature encoding method, the classification algorithm, and access to computational resources. In Chapter 3, we cover the critical computational advancements introduced in WarpNL that make it efficient in a modern computing framework. This includes efficient feature encoding, introduction of a log-odds ratio for comparison of naive Bayes posterior estimates, description of schema for parallel and distributed naive Bayes architectures, and use of machine learning classifiers to perform outgroup sequence classification. Finally in Chapter 4, we explore a variant of the Dirichlet multinomial distribution that underlies the naive Bayes likelihood, called the beta-Liouville multinomial. We show that the beta-Liouville multinomial can be used to enhance classifier performance, and we provide mathematical proofs regarding its convergence during maximum likelihood estimation. Overall, this work explores the naive Bayes algorithm in a modern context and shows that it is competitive for genetic sequence classification.
dc.format.medium	born digital
dc.format.medium	doctoral dissertations
dc.identifier	Lakin_colostate_0053A_16504.pdf
dc.identifier.uri	https://hdl.handle.net/10217/232600
dc.identifier.uri	https://doi.org/10.25675/3.05015
dc.language	English
dc.language.iso	eng
dc.publisher	Colorado State University. Libraries
dc.relation.ispartof	2020-
dc.rights	Copyright and other restrictions may apply. User is responsible for compliance with all applicable laws. For information about copyright law, please see https://libguides.colostate.edu/copyright.
dc.subject	genetic sequence classification
dc.subject	naive Bayes
dc.subject	supervised classification
dc.subject	genomics
dc.subject	bioinformatics
dc.subject	natural language processing
dc.title	Modern considerations for the use of naive Bayes in the supervised classification of genetic sequence data
dc.type	Text
dcterms.rights.dpla	This Item is protected by copyright and/or related rights (https://rightsstatements.org/vocab/InC/1.0/). You are free to use this Item in any way that is permitted by the copyright and related rights legislation that applies to your use. For other uses you need to obtain permission from the rights-holder(s).
thesis.degree.discipline	Microbiology, Immunology, and Pathology
thesis.degree.grantor	Colorado State University
thesis.degree.level	Doctoral
thesis.degree.name	Doctor of Philosophy (Ph.D.)

Files

Original bundle

Now showing 1 - 1 of 1

Name:: Lakin_colostate_0053A_16504.pdf
Size:: 2.61 MB
Format:: Adobe Portable Document Format

Download

Collections

2020-
Theses and Dissertations