Repository logo
 

Modern considerations for the use of naive Bayes in the supervised classification of genetic sequence data

dc.contributor.authorLakin, Steven M., author
dc.contributor.authorAbdo, Zaid, advisor
dc.contributor.authorRajopadhye, Sanjay, committee member
dc.contributor.authorStenglein, Mark, committee member
dc.contributor.authorStewart, Jane, committee member
dc.date.accessioned2021-06-07T10:21:07Z
dc.date.available2021-06-07T10:21:07Z
dc.date.issued2021
dc.description.abstractGenetic sequence classification is the task of assigning a known genetic label to an unknown genetic sequence. Often, this is the first step in genetic sequence analysis and is critical to understanding data produced by molecular techniques like high throughput sequencing. Here, we explore an algorithm called naive Bayes that was historically successful in classifying 16S ribosomal gene sequences for microbiome analysis. We extend the naive Bayes classifier to perform the task of general sequence classification by leveraging advancements in computational parallelism and the statistical distributions that underlie naive Bayes. In Chapter 2, we show that our implementation of naive Bayes, called WarpNL, performs within a margin of error of modern classifiers like Kraken2 and local alignment. We discuss five crucial aspects of genetic sequence classification and show how these areas affect classifier performance: the query data, the reference sequence database, the feature encoding method, the classification algorithm, and access to computational resources. In Chapter 3, we cover the critical computational advancements introduced in WarpNL that make it efficient in a modern computing framework. This includes efficient feature encoding, introduction of a log-odds ratio for comparison of naive Bayes posterior estimates, description of schema for parallel and distributed naive Bayes architectures, and use of machine learning classifiers to perform outgroup sequence classification. Finally in Chapter 4, we explore a variant of the Dirichlet multinomial distribution that underlies the naive Bayes likelihood, called the beta-Liouville multinomial. We show that the beta-Liouville multinomial can be used to enhance classifier performance, and we provide mathematical proofs regarding its convergence during maximum likelihood estimation. Overall, this work explores the naive Bayes algorithm in a modern context and shows that it is competitive for genetic sequence classification.
dc.format.mediumborn digital
dc.format.mediumdoctoral dissertations
dc.identifierLakin_colostate_0053A_16504.pdf
dc.identifier.urihttps://hdl.handle.net/10217/232600
dc.languageEnglish
dc.language.isoeng
dc.publisherColorado State University. Libraries
dc.relation.ispartof2020-
dc.rightsCopyright and other restrictions may apply. User is responsible for compliance with all applicable laws. For information about copyright law, please see https://libguides.colostate.edu/copyright.
dc.subjectgenetic sequence classification
dc.subjectnaive Bayes
dc.subjectsupervised classification
dc.subjectgenomics
dc.subjectbioinformatics
dc.subjectnatural language processing
dc.titleModern considerations for the use of naive Bayes in the supervised classification of genetic sequence data
dc.typeText
dcterms.rights.dplaThis Item is protected by copyright and/or related rights (https://rightsstatements.org/vocab/InC/1.0/). You are free to use this Item in any way that is permitted by the copyright and related rights legislation that applies to your use. For other uses you need to obtain permission from the rights-holder(s).
thesis.degree.disciplineMicrobiology, Immunology, and Pathology
thesis.degree.grantorColorado State University
thesis.degree.levelDoctoral
thesis.degree.nameDoctor of Philosophy (Ph.D.)

Files

Original bundle
Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
Lakin_colostate_0053A_16504.pdf
Size:
2.61 MB
Format:
Adobe Portable Document Format