Distributed systems in small scale research environments: Hadoop and the EM algorithm

Remington, Jason Michael, author; Draper, Bruce A. (Bruce Austin), 1962-, advisor; Böhm, Wim, advisor; Burns, Patrick J., committee member

Distributed systems in small scale research environments: Hadoop and the EM algorithm

Files

Remington_colostate_0053N_10606.pdf (371.56 KB)

Date

2011

Authors

Remington, Jason Michael, author

Draper, Bruce A. (Bruce Austin), 1962-, advisor

Böhm, Wim, advisor

Burns, Patrick J., committee member

Abstract

Distributed systems are widely used in large scale high performance computing environments, and often conjure visions of enormous data centers full of thousands of networked machines working together. Smaller research environments may not have access to such a data center, and many jobs in these environments may still take weeks or longer to complete. Systems that work well on hundreds or thousands of machines on Terabyte and larger data sets may not scale down to small environments with a couple dozen machines and gigabyte data sets. This research determines the viability of one such system in a small research environment in order to determine what issues arise when scaling down to such a small environment. Specifically, we use Hadoop to implement the Expectation Maximization algorithm, which is iterative, stateful, inherently parallel, and computationally expensive. We find that the lack of support for modeling data dependencies between records results in large amounts of network traffic, and that the lack of support for iterative Map/Reduce magnifies the overhead on jobs which require multiple iterations. These results expose key issues which need to be addressed for the distributed system to perform well in a small research environment.

Subject

small cluster

distributed systems

EM

expectation maximization

Hadoop

URI

http://hdl.handle.net/10217/46744

Collections

2000-2019
Theses and Dissertations

Full item page

Distributed systems in small scale research environments: Hadoop and the EM algorithm

Files

Date

Authors

Journal Title

Journal ISSN

Volume Title

Abstract

Description

Rights Access

Subject

Citation

URI

Associated Publications

Collections