Evaluating cluster quality for visual data

Wigness, Maggie, author; Draper, Bruce, advisor; Beveridge, Ross, committee member; Howe, Adele, committee member; Peterson, Chris, committee member

Evaluating cluster quality for visual data

Files

Wigness_colostate_0053N_11728.pdf (5.29 MB)

Date

2013

Authors

Wigness, Maggie, author

Draper, Bruce, advisor

Beveridge, Ross, committee member

Howe, Adele, committee member

Peterson, Chris, committee member

Abstract

Digital video cameras have made it easy to collect large amounts of unlabeled data that can be used to learn to recognize objects and actions. Collecting ground-truth labels for this data, however, is a much more time consuming task that requires human intervention. One approach to train on this data, while keeping the human workload to a minimum, is to cluster the unlabeled samples, evaluate the quality of the clusters, and then ask a human annotator to label only the clusters believed to be dominated by a single object/action class. This thesis addresses the task of evaluating the quality of unlabeled image clusters. We compare four cluster quality measures (and a baseline method) using real-world and synthetic data sets. Three of these measures can be found in the existing data mining literature: Dunn Index, Davies-Bouldin Index and Silhouette Width. We introduce a novel cluster quality measure as the fourth measure, derived from recent advances in approximate nearest neighbor algorithms from the computer vision literature, called Proximity Forest Connectivity (PFC). Experiments on real-world data show that no cluster quality measure performs "best" on all data sets; however, our novel PFC measure is always competitive and results in more top performances than any of the other measures. Results from synthetic data experiments show that while the data mining measures are susceptible to over-clustering typically required of visual data, PFC is much more robust. Further synthetic data experiments modeling features of visual data show that Davies-Bouldin is most robust to large amounts of class-specific noise. However, Davies-Bouldin, Silhouette and PFC all perform well in the presence of data with small amounts of class-specific noise, whereas Dunn struggles to perform better than random.

Subject

cluster quality measures

image clustering

computer vision

URI

http://hdl.handle.net/10217/79204

Collections

2000-2019
Theses and Dissertations

Full item page

Evaluating cluster quality for visual data

Files

Date

Authors

Journal Title

Journal ISSN

Volume Title

Abstract

Description

Rights Access

Subject

Citation

URI

Associated Publications

Collections