Improved estimation for complex surveys using modern regression techniques
Date
2011
Authors
McConville, Kelly, author
Breidt, F. Jay, advisor
Lee, Thomas, C. M., advisor
Opsomer, Jean, committee member
Lee, Myung-Hee, committee member
Doherty, Paul F., committee member
Journal Title
Journal ISSN
Volume Title
Abstract
In the field of survey statistics, finite population quantities are often estimated based on complex survey data. In this thesis, estimation of the finite population total of a study variable is considered. The study variable is available for the sample and is supplemented by auxiliary information, which is available for every element in the finite population. Following a model-assisted framework, estimators are constructed that exploit the relationship which may exist between the study variable and ancillary data. These estimators have good design properties regardless of model accuracy. Nonparametric survey regression estimation is applicable in natural resource surveys where the relationship between the auxiliary information and study variable is complex and of an unknown form. Breidt, Claeskens, and Opsomer (2005) proposed a penalized spline survey regression estimator and studied its properties when the number of knots is fixed. To build on their work, the asymptotic properties of the penalized spline regression estimator are considered when the number of knots goes to infinity and the locations of the knots are allowed to change. The estimator is shown to be design consistent and asymptotically design unbiased. In the course of the proof, a result is established on the uniform convergence in probability of the survey-weighted quantile estimators. This result is obtained by deriving a survey-weighted Hoeffding inequality for bounded random variables. A variance estimator is proposed and shown to be design consistent for the asymptotic mean squared error. Simulation results demonstrate the usefulness of the asymptotic approximations. Also in natural resource surveys, a substantial amount of auxiliary information, typically derived from remotely-sensed imagery and organized in the form of spatial layers in a geographic information system (GIS), is available. Some of this ancillary data may be extraneous and a sparse model would be appropriate. Model selection methods are therefore warranted. The 'least absolute shrinkage and selection operator' (lasso), presented by Tibshirani (1996), conducts model selection and parameter estimation simultaneously by penalizing the sum of the absolute values of the model coefficients. A survey-weighted lasso criterion, which accounts for the sampling design, is derived and a survey-weighted lasso estimator is presented. The root-n design consistency of the estimator and a central limit theorem result are proved. Several variants of the survey-weighted lasso estimator are constructed. In particular, a calibration estimator and a ridge regression approximation estimator are constructed to produce lasso weights that can be applied to several study variables. Simulation studies show the lasso estimators are more efficient than the regression estimator when the true model is sparse. The lasso estimators are used to estimate the proportion of tree canopy cover for a region of Utah. Under a joint design-model framework, the survey-weighted lasso coefficients are shown to be root-N consistent for the parameters of the superpopulation model and a central limit theorem result is found. The methodology is applied to estimate the risk factors for the Zika virus from an epidemiological survey on the island of Yap. A logistic survey-weighted lasso regression model is fit to the data and important covariates are identified.
Description
Rights Access
Subject
penalized splines
non-parametrics
survey
lasso
model selection