Repository logo
 

Some topics on model-based clustering

dc.contributor.authorWang, Lulu, author
dc.contributor.authorHoeting, Jennifer, advisor
dc.contributor.authorZhou, Wen, advisor
dc.contributor.authorWang, Haonan, committee member
dc.contributor.authorLaituri, Melinda, committee member
dc.date.accessioned2017-01-04T22:59:20Z
dc.date.available2018-12-30T22:59:20Z
dc.date.issued2016
dc.description.abstractCluster analysis is widely applied in various areas. Model-based clustering, which assumes a mixture model, is one of the most useful approaches in clustering. Using model-based clustering, we can make statistical inferences and obtain uncertainty estimates for parameters or clustering assignments. Traditional model-based clustering methods often assume a Gaussian mixture model which may not perform well in real applications such as data with heavy tails. Several non- or semi-parametric mixture models, which assume that the variables are independent to ensure parameter identifiability, have been studied in past years. In this dissertation, we propose two new methods for model-based clustering. The first method, semiparametric model-based clustering (SPM-clust), is based on a nonparanormal distribution for each cluster. The method accounts for correlations between variables while maintaining parameter identifiability under mild assumptions. By modeling the dependence between variables and relaxing the normality assumption, the proposed method is shown via simulations to have better performance than commonly used methods in clustering, especially for clustering non-Gaussian data. The second method is particularly useful for clustering high-dimensional data. The classical mixture model approach cannot cluster high-dimensional data due to the curse of dimensionality. Moreover, identifying important variables for separating unlabeled observations into homogeneous groups plays a critical role in dimension reduction and modeling data with complex structures. This problem is directly related to selecting informative variables in cluster analysis, where a small fraction of variables is identified for separating observed variable vectors Xi ∈ Rp, i = 1, . . . , n, into K possible classes. Utilizing the framework of model-based clustering, we introduce the PAirwise Reciprocal fuSE (PARSE) procedure based on a new class of penalization functions that imposes infinite penalties on variables with small differences across clusters. PARSE effectively avoids selecting an overly dense set of variables for separating observations into clusters. We establish the consistency of the proposed procedure for identifying informative variables for cluster analysis. The PARSE procedure is shown to enjoy certain optimality properties as well. We develop a backward selection algorithm, in conjunction with the EM algorithm, to implement PARSE. Simulation studies show that PARSE has competitive performance compared to other popular model-based clustering methods. PARSE is shown to select a sparse set of variables and produce accurate clustering results. We apply PARSE to microarray data on human asthma disease and discuss the biological implications of the results. We develop an R package PARSE which is available in CRAN for implementing regularization methods in model-based clustering including PARSE.
dc.format.mediumborn digital
dc.format.mediumdoctoral dissertations
dc.identifierWang_colostate_0053A_13975.pdf
dc.identifier.urihttp://hdl.handle.net/10217/178931
dc.languageEnglish
dc.language.isoeng
dc.publisherColorado State University. Libraries
dc.relation.ispartof2000-2019
dc.rightsCopyright and other restrictions may apply. User is responsible for compliance with all applicable laws. For information about copyright law, please see https://libguides.colostate.edu/copyright.
dc.titleSome topics on model-based clustering
dc.typeText
dcterms.embargo.expires2018-12-30
dcterms.embargo.terms2018-12-30
dcterms.rights.dplaThis Item is protected by copyright and/or related rights (https://rightsstatements.org/vocab/InC/1.0/). You are free to use this Item in any way that is permitted by the copyright and related rights legislation that applies to your use. For other uses you need to obtain permission from the rights-holder(s).
thesis.degree.disciplineStatistics
thesis.degree.grantorColorado State University
thesis.degree.levelDoctoral
thesis.degree.nameDoctor of Philosophy (Ph.D.)

Files

Original bundle
Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
Wang_colostate_0053A_13975.pdf
Size:
1.12 MB
Format:
Adobe Portable Document Format
Description: