Repository logo
 

Leveraging ensembles: balancing timeliness and accuracy for model training over voluminous datasets

dc.contributor.authorBudgaga, Walid, author
dc.contributor.authorPallickara, Shrideep, advisor
dc.contributor.authorPallickara, Sangmi Lee, advisor
dc.contributor.authorBen-Hur, Asa, committee member
dc.contributor.authorBreidt, F. Jay, committee member
dc.date.accessioned2020-06-22T11:54:03Z
dc.date.available2020-06-22T11:54:03Z
dc.date.issued2020
dc.description.abstractAs data volumes increase, there is a pressing need to make sense of the data in a timely fashion. Voluminous datasets are often high dimensional, with individual data points representing a vector of features. Data scientists fit models to the data—using all features or a subset thereof—and then use these models to inform their understanding of phenomena or make predictions. The performance of these analytical models is assessed based on their accuracy and ability to generalize on unseen data. Several existing frameworks can be used for drawing insights from voluminous datasets. However, there are some inefficiencies associated with these frameworks including scalability, limited applicability beyond a target domain, prolonged training times, poor resource utilization, and insufficient support for combining diverse model fitting algorithms. In this dissertation, we describe our methodology for scalable supervised learning over voluminous datasets. The methodology explores the impact of partitioning the feature space, building models over these partitioned subsets of the data, and their impact on training times and accuracy. Using our methodology, a practitioner can harness a mix of learning methods to build diverse models over the partitioned data. Rather than build a single, all-encompassing model we construct an ensemble of models trained independently over different portions of the dataset. In particular, we rely on concurrent and independent learning from different portions of the data space to overcome the issues relating to resource utilization and completion times associated with distributed training of a single model over the entire dataset. Our empirical benchmarks are performed using datasets from diverse domains, including epidemiology, music, and weather. These benchmarks demonstrate the suitability of our methodology for reducing training times while preserving accuracy in contrast to those obtained from a complex model trained on the entire dataset. In particular, our methodology utilizes resources effectively by amortizing I/O and CPU costs by relying on a distributed environment while ensuring a significant reduction of network traffic during training.
dc.format.mediumborn digital
dc.format.mediumdoctoral dissertations
dc.identifierBudgaga_colostate_0053A_16069.pdf
dc.identifier.urihttps://hdl.handle.net/10217/208601
dc.languageEnglish
dc.language.isoeng
dc.publisherColorado State University. Libraries
dc.relation.ispartof2020-
dc.rightsCopyright and other restrictions may apply. User is responsible for compliance with all applicable laws. For information about copyright law, please see https://libguides.colostate.edu/copyright.
dc.subjectmodel ensembles
dc.subjectdistributed systems
dc.subjectscalable learning
dc.titleLeveraging ensembles: balancing timeliness and accuracy for model training over voluminous datasets
dc.typeText
dcterms.rights.dplaThis Item is protected by copyright and/or related rights (https://rightsstatements.org/vocab/InC/1.0/). You are free to use this Item in any way that is permitted by the copyright and related rights legislation that applies to your use. For other uses you need to obtain permission from the rights-holder(s).
thesis.degree.disciplineComputer Science
thesis.degree.grantorColorado State University
thesis.degree.levelDoctoral
thesis.degree.nameDoctor of Philosophy (Ph.D.)

Files

Original bundle
Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
Budgaga_colostate_0053A_16069.pdf
Size:
3.87 MB
Format:
Adobe Portable Document Format