Leveraging ensembles: balancing timeliness and accuracy for model training over voluminous datasets

Budgaga, Walid, author; Pallickara, Shrideep, advisor; Pallickara, Sangmi Lee, advisor; Ben-Hur, Asa, committee member; Breidt, F. Jay, committee member

Leveraging ensembles: balancing timeliness and accuracy for model training over voluminous datasets

dc.contributor.author	Budgaga, Walid, author
dc.contributor.author	Pallickara, Shrideep, advisor
dc.contributor.author	Pallickara, Sangmi Lee, advisor
dc.contributor.author	Ben-Hur, Asa, committee member
dc.contributor.author	Breidt, F. Jay, committee member
dc.date.accessioned	2020-06-22T11:54:03Z
dc.date.available	2020-06-22T11:54:03Z
dc.date.issued	2020
dc.description.abstract	As data volumes increase, there is a pressing need to make sense of the data in a timely fashion. Voluminous datasets are often high dimensional, with individual data points representing a vector of features. Data scientists fit models to the data—using all features or a subset thereof—and then use these models to inform their understanding of phenomena or make predictions. The performance of these analytical models is assessed based on their accuracy and ability to generalize on unseen data. Several existing frameworks can be used for drawing insights from voluminous datasets. However, there are some inefficiencies associated with these frameworks including scalability, limited applicability beyond a target domain, prolonged training times, poor resource utilization, and insufficient support for combining diverse model fitting algorithms. In this dissertation, we describe our methodology for scalable supervised learning over voluminous datasets. The methodology explores the impact of partitioning the feature space, building models over these partitioned subsets of the data, and their impact on training times and accuracy. Using our methodology, a practitioner can harness a mix of learning methods to build diverse models over the partitioned data. Rather than build a single, all-encompassing model we construct an ensemble of models trained independently over different portions of the dataset. In particular, we rely on concurrent and independent learning from different portions of the data space to overcome the issues relating to resource utilization and completion times associated with distributed training of a single model over the entire dataset. Our empirical benchmarks are performed using datasets from diverse domains, including epidemiology, music, and weather. These benchmarks demonstrate the suitability of our methodology for reducing training times while preserving accuracy in contrast to those obtained from a complex model trained on the entire dataset. In particular, our methodology utilizes resources effectively by amortizing I/O and CPU costs by relying on a distributed environment while ensuring a significant reduction of network traffic during training.
dc.format.medium	born digital
dc.format.medium	doctoral dissertations
dc.identifier	Budgaga_colostate_0053A_16069.pdf
dc.identifier.uri	https://hdl.handle.net/10217/208601
dc.identifier.uri	https://doi.org/10.25675/3.04843
dc.language	English
dc.language.iso	eng
dc.publisher	Colorado State University. Libraries
dc.relation.ispartof	2020-
dc.rights	Copyright and other restrictions may apply. User is responsible for compliance with all applicable laws. For information about copyright law, please see https://libguides.colostate.edu/copyright.
dc.subject	model ensembles
dc.subject	distributed systems
dc.subject	scalable learning
dc.title	Leveraging ensembles: balancing timeliness and accuracy for model training over voluminous datasets
dc.type	Text
dcterms.rights.dpla	This Item is protected by copyright and/or related rights (https://rightsstatements.org/vocab/InC/1.0/). You are free to use this Item in any way that is permitted by the copyright and related rights legislation that applies to your use. For other uses you need to obtain permission from the rights-holder(s).
thesis.degree.discipline	Computer Science
thesis.degree.grantor	Colorado State University
thesis.degree.level	Doctoral
thesis.degree.name	Doctor of Philosophy (Ph.D.)

Files

Original bundle

Now showing 1 - 1 of 1

Name:: Budgaga_colostate_0053A_16069.pdf
Size:: 3.87 MB
Format:: Adobe Portable Document Format

Download

Collections

2020-
Theses and Dissertations