Repository logo
 

Distributed systems in small scale research environments: Hadoop and the EM algorithm

dc.contributor.authorRemington, Jason Michael, author
dc.contributor.authorDraper, Bruce A. (Bruce Austin), 1962-, advisor
dc.contributor.authorBöhm, Wim, advisor
dc.contributor.authorBurns, Patrick J., committee member
dc.date.accessioned2007-01-03T04:58:22Z
dc.date.available2007-01-03T04:58:22Z
dc.date.issued2011
dc.description.abstractDistributed systems are widely used in large scale high performance computing environments, and often conjure visions of enormous data centers full of thousands of networked machines working together. Smaller research environments may not have access to such a data center, and many jobs in these environments may still take weeks or longer to complete. Systems that work well on hundreds or thousands of machines on Terabyte and larger data sets may not scale down to small environments with a couple dozen machines and gigabyte data sets. This research determines the viability of one such system in a small research environment in order to determine what issues arise when scaling down to such a small environment. Specifically, we use Hadoop to implement the Expectation Maximization algorithm, which is iterative, stateful, inherently parallel, and computationally expensive. We find that the lack of support for modeling data dependencies between records results in large amounts of network traffic, and that the lack of support for iterative Map/Reduce magnifies the overhead on jobs which require multiple iterations. These results expose key issues which need to be addressed for the distributed system to perform well in a small research environment.
dc.format.mediumborn digital
dc.format.mediummasters theses
dc.identifierRemington_colostate_0053N_10606.pdf
dc.identifier.urihttp://hdl.handle.net/10217/46744
dc.languageEnglish
dc.language.isoeng
dc.publisherColorado State University. Libraries
dc.relation.ispartof2000-2019
dc.rightsCopyright and other restrictions may apply. User is responsible for compliance with all applicable laws. For information about copyright law, please see https://libguides.colostate.edu/copyright.
dc.subjectsmall cluster
dc.subjectdistributed systems
dc.subjectEM
dc.subjectexpectation maximization
dc.subjectHadoop
dc.titleDistributed systems in small scale research environments: Hadoop and the EM algorithm
dc.typeText
dcterms.rights.dplaThis Item is protected by copyright and/or related rights (https://rightsstatements.org/vocab/InC/1.0/). You are free to use this Item in any way that is permitted by the copyright and related rights legislation that applies to your use. For other uses you need to obtain permission from the rights-holder(s).
thesis.degree.disciplineComputer Science
thesis.degree.grantorColorado State University
thesis.degree.levelMasters
thesis.degree.nameMaster of Science (M.S.)

Files

Original bundle
Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
Remington_colostate_0053N_10606.pdf
Size:
371.56 KB
Format:
Adobe Portable Document Format
Description: