Browsing by Author "Breidt, F. Jay, advisor"
Now showing 1 - 14 of 14
Results Per Page
Sort Options
Item Open Access Improved estimation for complex surveys using modern regression techniques(Colorado State University. Libraries, 2011) McConville, Kelly, author; Breidt, F. Jay, advisor; Lee, Thomas, C. M., advisor; Opsomer, Jean, committee member; Lee, Myung-Hee, committee member; Doherty, Paul F., committee memberIn the field of survey statistics, finite population quantities are often estimated based on complex survey data. In this thesis, estimation of the finite population total of a study variable is considered. The study variable is available for the sample and is supplemented by auxiliary information, which is available for every element in the finite population. Following a model-assisted framework, estimators are constructed that exploit the relationship which may exist between the study variable and ancillary data. These estimators have good design properties regardless of model accuracy. Nonparametric survey regression estimation is applicable in natural resource surveys where the relationship between the auxiliary information and study variable is complex and of an unknown form. Breidt, Claeskens, and Opsomer (2005) proposed a penalized spline survey regression estimator and studied its properties when the number of knots is fixed. To build on their work, the asymptotic properties of the penalized spline regression estimator are considered when the number of knots goes to infinity and the locations of the knots are allowed to change. The estimator is shown to be design consistent and asymptotically design unbiased. In the course of the proof, a result is established on the uniform convergence in probability of the survey-weighted quantile estimators. This result is obtained by deriving a survey-weighted Hoeffding inequality for bounded random variables. A variance estimator is proposed and shown to be design consistent for the asymptotic mean squared error. Simulation results demonstrate the usefulness of the asymptotic approximations. Also in natural resource surveys, a substantial amount of auxiliary information, typically derived from remotely-sensed imagery and organized in the form of spatial layers in a geographic information system (GIS), is available. Some of this ancillary data may be extraneous and a sparse model would be appropriate. Model selection methods are therefore warranted. The 'least absolute shrinkage and selection operator' (lasso), presented by Tibshirani (1996), conducts model selection and parameter estimation simultaneously by penalizing the sum of the absolute values of the model coefficients. A survey-weighted lasso criterion, which accounts for the sampling design, is derived and a survey-weighted lasso estimator is presented. The root-n design consistency of the estimator and a central limit theorem result are proved. Several variants of the survey-weighted lasso estimator are constructed. In particular, a calibration estimator and a ridge regression approximation estimator are constructed to produce lasso weights that can be applied to several study variables. Simulation studies show the lasso estimators are more efficient than the regression estimator when the true model is sparse. The lasso estimators are used to estimate the proportion of tree canopy cover for a region of Utah. Under a joint design-model framework, the survey-weighted lasso coefficients are shown to be root-N consistent for the parameters of the superpopulation model and a central limit theorem result is found. The methodology is applied to estimate the risk factors for the Zika virus from an epidemiological survey on the island of Yap. A logistic survey-weighted lasso regression model is fit to the data and important covariates are identified.Item Open Access Nonparametric tests for informative selection and small area estimation for reconciling survey estimates(Colorado State University. Libraries, 2020) Liu, Teng, author; Breidt, F. Jay, advisor; Wang, Haonan, committee member; Estep, Donald J., committee member; Doherty, Paul F., Jr., committee memberTwo topics in the analysis of complex survey data are addressed: testing for informative selection and addressing temporal discontinuities due to survey redesign. Informative selection, in which the distribution of response variables given that they are sampled is different from their distribution in the population, is pervasive in modern complex surveys. Failing to take such informativeness into account could produce severe inferential errors, such as biased parameter estimators, wrong coverage rates of confidence intervals, incorrect test statistics, and erroneous conclusions. While several parametric procedures exist to test for informative selection in the survey design, it is often hard to check the parametric assumptions on which those procedures are based. We propose two classes of nonparametric tests for informative selection, each motivated by a nonparametric test for two independent samples. The first nonparametric class generalizes classic two-sample tests that compare empirical cumulative distribution functions, including Kolmogorov–Smirnov and Cramér–von Mises, by comparing weighted and unweighted empirical cumulative distribution functions. The second nonparametric class adapts two-sample tests that compare distributions based on the maximum mean discrepancy to the setting of weighted and unweighted distributions. The asymptotic distributions of both test statistics are established under the null hypothesis of noninformative selection. Simulation results demonstrate the usefulness of the asymptotic approximations, and show that our tests have competitive power with parametric tests in a correctly specified parametric setting while achieving greater power in misspecified scenarios. Many surveys face the problem of comparing estimates obtained with different methodology, including differences in frames, measurement instruments, and modes of delivery. Differences may exist within the same survey; for example, multi-mode surveys are increasingly common. Further, it is inevitable that surveys need to be redesigned from time to time. Major redesign of survey processes could affect survey estimates systematically, and it is important to quantify and adjust for such discontinuities between the designs to ensure comparability of estimates over time. We propose a small area estimation approach to reconcile two sets of survey estimates, and apply it to two surveys in the Marine Recreational Information Program (MRIP). We develop a log-normal model for the estimates from the two surveys, accounting for temporal dynamics through regression on population size and state-by-wave seasonal factors, and accounting in part for changing coverage properties through regression on wireless telephone penetration. Using the estimated design variances, we develop a regression model that is analytically consistent with the log-normal mean model. We use the modeled design variances in a Fay-Herriot small area estimation procedure to obtain empirical best linear unbiased predictors of the reconciled effort estimates for all states and waves, and provide an asymptotically valid mean square error approximation.Item Open Access Penalized estimation for sample surveys in the presence of auxiliary variables(Colorado State University. Libraries, 2008) Delorey, Mark J., author; Breidt, F. Jay, advisorIn conducting sample surveys, time and financial resources can be limited but research questions are wide and varied. Thus, methods for analysis must make the best use of whatever data are available and produce results that address a variety of needs. Motivation for this research comes from surveys of aquatic resources, in which sample sizes are small to moderate, but auxiliary information is available to supplement measured survey responses. The problems of survey estimation are considered, tied together in their use of constrained/penalized estimation techniques for combining information from the auxiliary information and the responses of interest. We study a small area problem with the goal of obtaining a good ensemble estimate, that is, a collection of estimates for individual small areas that collectively give a good estimate of the overall distribution function across small areas. Often, estimators that are good for one purpose may not be good for others. For example, estimation of the distribution function itself (as in Cordy and Thomas, 1997) can address questions of variability and extremes but does not provide individual estimators of the small areas, nor is it appropriate when auxiliary information can be made of use. Bayes estimators are good individual estimators in terms of mean squared error but are not variable enough to represent ensemble traits (Ghosh, 1992). An algorithm that extends the constrained Bayes (CB) methods of Louis (1984) and Ghosh (1992) for use in a model with a general covariance matrix is presented. This algorithm produces estimators with similar properties as (CB), and we refer to this method as general constrained Bayes (GCB). The ensemble GCB estimator is asymptotically unbiased for the posterior mean of the empirical distribution function (edf). The ensemble properties of transformed GCB estimates are investigated to determine if the desirable ensemble characteristics displayed by the GCB estimator are preserved under such transformations. The GCB algorithm is then applied to complex models such as conditional autoregressive spatial models and to penalized spline models. Illustrative examples include the estimation of lip cancer risk, mean water acidity, and rates of change in water acidity. We also study a moderate area problem in which the goal is to derive a set of survey weights that can be applied to each study variable with reasonable predictive results. Zheng and Little (2003) use penalized spline regression in a model-based approach for finite population estimation in a two-stage sample when predictor variables are available. Breidt et al. (2005) propose a class of model-assisted estimators based on penalized spline regression in single stage sampling. Because unbiasedness of the model-based estimator requires that the model be correctly specified, we look at extending model-assisted estimation to the two-stage case. By calibrating the degrees of freedom of the smooth to the most important study variables, a set of weights can be obtained that produce design consistent estimators for all study variables. The model-assisted estimator is compared to other estimators in a simulation study. Results from the simulation study show that the model-assisted estimator is comparable to other estimators when the model is correctly specified and generally superior when the model is incorrectly specified.Item Open Access Randomization tests for experiments embedded in complex surveys(Colorado State University. Libraries, 2022) Brown, David A., author; Breidt, F. Jay, advisor; Sharp, Julia, committee member; Zhou, Tianjian, committee member; Ogle, Stephen, committee memberEmbedding experiments in complex surveys has become increasingly important. For scientific questions, such embedding allows researchers to take advantage of both the internal validity of controlled experiments and the external validity of probability-based samples of a population. Within survey statistics, declining response rates have led to the development of new methods, known as adaptive and responsive survey designs, that try to increase or maintain response rates without negatively impacting survey quality. Such methodologies are assessed experimentally. Examples include a series of embedded experiments in the 2019 Triennial Community Health Survey (TCHS), conducted by the Health District of Northern Larimer County in collaboration with the Department of Statistics at Colorado State University, to determine the effects of monetary incentives, targeted mailing of reminders, and double-stuffed envelopes (including both English and Spanish versions of the survey) on response rates, cost, and representativeness of the sample. This dissertation develops methodology and theory of randomization-based tests embedded in complex surveys, assesses the methodology via simulation, and applies the methods to data from the 2019 TCHS. An important consideration in experiments to increase response rates is the overall balance of the sample, because higher overall response might still underrepresent important groups. There have been advances in recent years on methods to assess the representativeness of samples, including application of the dissimilarity index (DI) to help evaluate the representativeness of a sample under the different conditions in an incentive experiment (Biemer et al. [2018]). We develop theory and methodology for design-based inference for the DI when used in a complex survey. Simulation studies show that the linearization method has good properties, with good confidence interval coverage even in cases when the true DI is close to zero, even though point estimates may be biased. We then develop a class of randomization tests for evaluating experiments embedded in complex surveys. We consider a general parametric contrast, estimated using the design-weighted Narain-Horvitz-Thompson (NHT) approach, in either a completely randomized design or a randomized complete block design embedded in a complex survey. We derive asymptotic normal approximations for the randomization distribution of a general contrast, from which critical values can be derived for testing the null hypothesis that the contrast is zero. The asymptotic results are conditioned on the complex sample, but we include results showing that, under mild conditions, the inference extends to the finite population. Further, we develop asymptotic power properties of the tests under moderate conditions. Through simulation, we illustrate asymptotic properties of the randomization tests and compare the normal approximations of the randomization tests with corresponding Monte Carlo tests, with a design-based test developed by van den Brakel, and with randomization tests developed by Fisher-Pitman-Welch and Neyman. The randomization approach generalizes broadly to other kinds of embedded experimental designs and null hypothesis testing problems, for very general survey designs. The randomization approach is then extended from NHT estimators to generalized regression estimators that incorporate auxiliary information, and from linear contrasts to comparisons of nonlinear functions.Item Open Access Semiparametric regression in the presence of complex variance structures arising from small angle x-ray scattering data(Colorado State University. Libraries, 2014) Bugbee, Bruce D., author; Breidt, F. Jay, advisor; Estep, Don, advisor; Meyer, Mary, committee member; Hoeting, Jennifer, committee member; Luger, Karolin, committee memberAn ongoing problem in structural biology is how best to infer structural information for complex, biological macromolecules from indirect observational data. Molecular shape dictates functionality but is not always directly observable. There exists a wide class of experimental methods whose data can be used for indirectly inferring molecular shape features with varying degrees of resolution. Of these methods, small angle X-ray scattering (SAXS) is desirable due to low requirements on the sample of interest. However, SAXS data suffers numerous statistical problems that require the development of novel methodologies. A primary concern is the impact of radially reducing two-dimensional sensor data to a series of smooth mean and variance curves. Additionally, pronounced heteroskedasticity is often observed near sensor boundaries. The work presented here focuses on developing general model frameworks and implementation methods appropriate for SAXS data. Semiparametric regression refers to models that combine known parametric structures with flexible nonparametric components. Three semiparametric regression model frameworks that are well-suited for handling smooth data are presented. The first model introduced is the standard semiparametric regression model, described as a mixed model with low rank penalized splines as random effects. The second model extends the first to the case of heteroskedastic errors, which violate standard model assumptions. The latent variance function in the model is estimated through an additional semiparametric regression, allowing for appropriate uncertainty estimation at the mean level. The final model considers a data structure unique to SAXS experiments. This model incorporates both radial mean and radial variance data in hopes to better infer three-dimensional shape properties and understand experimental effects by including all available data. Each of the three model frameworks is structured hierarchically. Bayesian inference is appealing in this context, as it provides efficient and generalized modeling frameworks in a unified way. The main statistical contributions of this thesis are from the specific methods developed to address the computational challenges of Bayesian inference for these models. The contributions include new Markov Chain Monte Carlo (MCMC) procedures for numerical approximation of posterior distributions and novel variational approximations that are extremely fast and accurate. For the heteroskedastic semiparametric case, known form posterior conditionals are available for all model parameters save for the regression coefficients controlling the latent model variance function. A novel implementation of a multivariate delayed rejection adaptive Metropolis (DRAM) procedure is used to sample from this posterior conditional distribution. The joint model for radial mean and radial variance data is shown to be of comparable structure to the heteroskedastic case and the new DRAM methodology is extended to handle this case. Simulation studies of all three methods are provided, showing that these models provide accurate fits of observed data and latent variance functions. The demands of scientific data processing in the context of SAXS, where large data sets are rapidly attained, lead to consideration of fast approximations as alternatives to MCMC. {Variational approximations} or {Variational Bayes} describes a class of approximation methods where the posterior distribution of the parameters is approximated by minimizing the Kullback-Leibler divergence between the true posterior and a class of distributions under mild structural constraints. Variational approximations have been shown to be good approximations of true posteriors in many cases. A novel variational approximation for the general heteroskedastic semiparametric regression model is derived here. Simulation studies are provided demonstrating fit and coverage properties comparable to the DRAM results at a fraction of the computational cost. A variational approximation for the joint model of radial mean and variance data is also provided but is shown to suffer from poor performance due to high correlation across a subset of regression parameters. The heteroskedastic semiparametric regression framework has some strong structural relationships with a distinct, important problem: spatially adaptive smoothing. A noisy function with different amounts of smoothness over its domain may be systematically under-smoothed or over-smoothed if the smoothing is not spatially adaptive. A novel variational approximation is derived for the problem of spatially adaptive penalized spline regression, and shown to have excellent performance. This approximation method is shown to be able to fit highly oscillatory data while not requiring the traditional tuning and computational resources of standard MCMC implementations. Potential scientific contribution of the statistical methodology developed here are illuminated with SAXS data examples. Analysis of SAXS data typically has two primary concerns: description of experimental effects and estimation of physical shape parameters. Formal statistical procedures for testing the effect of sample concentration and exposure time are presented as alternatives to current methods, in which data sets are evaluated subjectively and often combined in ad hoc ways. Additionally, estimation procedures for the scattering intensity at zero angle, known to be proportional to molecular weight, and the radius of gyration are described along with appropriate measures of uncertainty. Finally, a brief example of the joint radial mean and variance method is provided. Guidelines for extending the models presented here to more complex SAXS problems are also given.Item Open Access Spatial models with applications in computer experiments(Colorado State University. Libraries, 2008) Wang, Ke, author; Davis, Richard A., advisor; Breidt, F. Jay, advisorOften, a deterministic computer response is modeled as a realization from a, stochastic process such as a Gaussian random field. Due to the limitation of stationary Gaussian process (GP) in inhomogeneous smoothness, we consider modeling a deterministic computer response as a realization from a stochastic heteroskedastic process (SHP), a stationary non-Gaussian process. Conditional on a latent process, the SHP has non-stationary covariance function and is a non-stationary GP. As such, the sample paths of this process exhibit greater variability and hence offer more modeling flexibility than those produced by a, traditional GP model. We use maximum likelihood for inference in the SHP model, which is complicated by the high dimensionality of the latent process. Accordingly, we develop an importance sampling method for likelihood computation and use a low-rank kriging approximation to reconstruct the latent process. Responses at unobserved locations can be predicted using empirical best predictors or by empirical best linear unbiased predictors. In addition, prediction error variances are obtained. The SHP model can be used in an active learning context, adaptively selecting new locations that provide improved estimates of the response surface. Estimation, prediction, and adaptive sampling with the SHP model are illustrated with several examples. Our spatial model can be adapted to model the first partial derivative process. The derivative process provides additional information about the shape and smoothness of the underlying deterministic function and can assist in the prediction of responses at unobserved sites. The unconditional correlation function for the derivative process presents some interesting properties, and can be used as a new class of spatial correlation functions. For parameter estimation, we propose to use a similar strategy to develop an importance sampling technique to compute the joint likelihood of responses and derivatives. The major difficulties of bringing in derivative information are the increase in the dimensionality of the latent process and the numerical problems of inverting the enlarged covariance matrix. Some possible ways to utilize this information more efficiently are proposed.Item Open Access Spatial processes with stochastic heteroscedasticity(Colorado State University. Libraries, 2008) Huang, Wenying, author; Breidt, F. Jay, advisor; Davis, Richard A., advisorStationary Gaussian processes are widely used in spatial data modeling and analysis. Stationarity is a relatively restrictive assumption regarding spatial association. By introducing stochastic volatility into a Gaussian process, we propose a stochastic heteroscedastic process (SHP) with conditional nonstationarity. That is, conditional on a latent Gaussian process, the SHP is a Gaussian process with non-stationary covariance structure. Unconditionally, the SHP is a stationary non-Gaussian process. The realizations from SHP are versatile and can represent spatial inhomogeneities. The unconditional correlation of SHP offers a rich class of correlation functions which can also allow for a smoothed nugget effect. For maximum likelihood estimation, we propose to apply importance sampling in the likelihood calculation and latent process estimation. The importance density we constructed is of the same dimensionality as the observations. When the sample size is large, the importance sampling scheme becomes infeasible and/or inaccurate. A low-dimensional approximation model is developed to solve the numerical difficulties. We develop two spatial prediction methods: PBP (plug-in best predictor) and PBLUP (plug-in best linear unbiased predictor). Empirical results with simulated and real data show improved out-of-sample prediction performance of SHP modeling over stationary Gaussian process modeling. We extend the single-realization model to SHP model with replicates. The spatial replications are modeled as independent realizations from a SHP model conditional on a common latent process. A simulation study shows substantial improvements in parameter estimation and process prediction when replicates are available. In a example with real atmospheric deposition data, the SHP model with replicates outperforms the Gaussian process model in prediction by capturing the spatial volatilities.Item Open Access State-space models for stream networks(Colorado State University. Libraries, 2007) Coar, William J., author; Breidt, F. Jay, advisorThe natural branching that occurs in a stream network, in which two upstream reaches merge to create a new downstream reach, generates a tree structure. Furthermore, because of the natural flow of water in a stream network, characteristics of a downstream reach may depend on characteristics of upstream reaches. Since the flow of water from reach to reach provides a natural time-like ordering throughout the stream network, we propose a state-space model to describe the spatial dependence in this tree-like structure with ordering based on flow. Developing a state-space formulation permits the use of the well known Kalman recursions. Variations of the Kalman Filter and Smoother are derived for the tree-structured state-space model, which allows recursive estimation of unobserved states and prediction of missing observations on the network, as well as computation of the Gaussian likelihood, even when the data are incomplete. To reduce the computational burden that may be associated with optimization of this exact likelihood, a version of the expectation-maximization (EM) algorithm is presented that uses the Kalman Smoother to fill in missing values in the E-step, and maximizes the Gaussian likelihood for the completed dataset in the M-step. Several forms of dependence for discrete processes on a stream network are considered, such as network analogues of the autoregressive-moving average model and stochastic trend models. Network parallels for first and second differences in time-series are defined, which allow for definition of a spline smoother on a stream network through a special case of a local linear trend model. We have taken the approach of modeling a discrete process, which we see as a building block to more appropriate yet more complicated models. Adaptation of this state-space model and Kalman prediction equations to allow for more complicated forms of spatial and perhaps temporal dependence is a potential area of future research. Other possible directions for future research are non-Gaussian and nonlinear error structures, model selection, and properties of estimators.Item Open Access Statistical innovations for estimating shape characteristics of biological macromolecules in solution using small-angle x-ray scattering data(Colorado State University. Libraries, 2016) Alsaker, Cody, author; Breidt, F. Jay, advisor; Estep, Don, committee member; Kokoszka, Piotr, committee member; Luger, Karolin, committee memberSmall-angle X-ray scattering (SAXS) is a technique that yields low-resolution images of biological macromolecules by exposing a solution containing the molecule to a powerful X-ray beam. The beam scatters when it interacts with the molecule. The intensity of the scattered beam is recorded on a detector plate at various scattering angles, and contains information on structural characteristics of the molecule in solution. In particular, the radius of gyration (Rg) for a molecule, which is a measure of the spread of its mass, can be estimated from the lowest scattering angles of SAXS data using a regression technique known as Guinier analysis. The analysis requires specification of a range or “window” of scattering angles over which the regression relationship holds. We have thus developed methodology and supporting asymptotic theory for selection of an optimal window, minimum mean square error estimation of the radius of gyration, and estimation of its variance. The theory and methodology are developed using a local polynomial model with autoregressive errors. Simulation studies confirm the quality of the asymptotic approximations and the superior performance of the proposed methodology relative to the accepted standard. We show that the algorithm is applicable to data acquired from proteins, nucleic acids and their complexes, and we demonstrate with examples that the algorithm improves the ability to test biological hypotheses. The radius of gyration is a normalized second moment of the pairwise distance distribution p(r), which describes the relative frequency of inter-atomic distances in the structure of the molecule. By extending the theory to fourth moments, we show that a new parameter ψ can be calculated theoretically from p(r) and estimated from experimental SAXS data, using a method that extends Guinier's Rg estimation procedure. This new parameter yields an enhanced ability to use intensity data to distinguish between two molecules with different but similar Rg values. Analysis of existing structures in the protein data bank (PDB) shows that the theoretical ψ values relate closely to the aspect ratio of a molecular structure. The combined values for Rg and ψ acquired from experimental data provide estimates for the dimensions and associated uncertainties for a standard geometric shape, representing the particle in solution. We have chosen the cylinder as the standard shape and show that a simple, automated procedure gives a cylindrical estimate of a particle of interest. The cylindrical estimate in turn yields a good first approximation to the maximum inter-atomic distance in a molecule, Dmax, an important parameter in shape reconstruction. As with estimation of Rg, estimation of ψ requires specification of a window of angles over which to conduct the higher-order Guinier analysis. We again employ a local polynomial model with autoregressive errors to derive methodology and supporting asymptotic theory for selection of an optimal window, minimum mean square error estimation of the aspect ratio, and estimation of its variance. Recent advances in SAXS data collection and more comprehensive data comparisons have resulted in a great need for automated scripts that analyze SAXS data. Our procedures to estimate Rg and ψ can be automated easily and can thus be used for large suites of SAXS data under various experimental conditions, in an objective and reproducible manner. The new methods are applied to 357 SAXS intensity curves arising from a study on the wild type nucleosome core particle and its mutants and their behavior under different experimental conditions. The resulting Rg2 values constitute a dataset which is then analyzed to account for the complex dependence structure induced by the experimental protocols. The analysis yields powerful scientific inferences and insight into better design of SAXS experiments. Finally, we consider a measurement error problem relevant to the estimation of the radius of gyration. In a SAXS experiment, it is standard to obtain intensity curves at different concentrations of the molecule in solution. Concentration-by-angle interactions may be present in such data, and analysis is complicated by the fact that actual concentration levels are unknown, but are measured with some error. We therefore propose a model and estimation procedure that allows estimation of true concentration ratios and concentration-by-angle interactions, without requiring any information about concentration other than that contained in the SAXS data.Item Open Access Statistical modeling and inference for complex-structured count data with applications in genomics and social science(Colorado State University. Libraries, 2020) Cao, Meng, author; Zhou, Wen, advisor; Breidt, F. Jay, advisor; Estep, Don, committee member; Meyer, Mary C., committee member; Peers, Graham, committee memberThis dissertation describes models, estimation methods, and testing procedures for count data that build upon classic generalized linear models, including Gaussian, Poisson, and negative binomial regression. The methodological extensions proposed in this dissertation are motivated by complex structures for count data arising in three important classes of scientific problems, from both genomics and sociological contexts. Complexities include large scale, temporal dependence, zero-inflation and other mixture features, and group structure. The first class of problems involves count data that are collected from longitudinal RNA sequencing (RNA-seq) experiments, where the data consist of tens of thousands of short time series of counts, with replicate time series under treatment and under control. In order to determine if the time course differs between treatment and control, we consider two questions: 1) whether the treatment affects the geometric attributes of the temporal profiles and 2) whether any treatment effect varies over time. To answer the first question, we determine whether there has been a fundamental change in shape by modeling the transformed count data for genes at each time point using a Gaussian distribution, with the mean temporal profile generated by spline models, and introduce a measurement that quantifies the average minimum squared distance between the locations of peaks (or valleys) of each gene's temporal profile across experimental conditions. We then develop a testing framework based on a permutation procedure. Via simulation studies, we show that the proposed test achieves good power while controlling the false discovery rate. We also apply the test to data collected from a light physiology experiment on maize. To answer the second question, we model the time series of counts for each gene by a Gaussian-Negative Binomial model and introduce a new testing procedure that enjoys the optimality property of maximum average power. The test allows not only identification of traditional differentially expressed genes but also testing of a variety of composite hypotheses of biological interest. We establish the identifiability of the proposed model, implement the proposed method via efficient algorithms, and expose its good performance via simulation studies. The procedure reveals interesting biological insights when applied to data from an experiment that examines the effect of varying light environments on the fundamental physiology of a marine diatom. The second class of problems involves analyzing group-structured sRNA data that consist of independent replicates of counts for each sRNA across experimental conditions. Most existing methods—for both normalization and differential expression—are designed for non-group structured data. These methods may fail to provide correct normalization factors or fail to control FDR. They may lack power and may not be able to make inference on group effects. To address these challenges simultaneously, we introduce an inferential procedure using a group-based negative binomial model and a bootstrap testing method. This procedure not only provides a group-based normalization factor, but also enables group-based differential expression analysis. Our method shows good performance in both simulation studies and analysis of experimental data on roundworm. The last class of problems is motivated by the study of sensitive behaviors. These problems involve mixture-distributed count data that are collected by a quantitative randomized response technique (QRRT) which guarantees respondent anonymity. We propose a Poisson regression method based on maximum likelihood estimation computed via the EM algorithm. This method allows assessment of the importance of potential drivers of different quantities of non-compliant behavior. The method is illustrated with a case study examining potential drivers of non-compliance with hunting regulations in Sierra Leone.Item Open Access Survey sampling with nonparametric methods: endogenous post-stratification and penalized instrumental variables(Colorado State University. Libraries, 2012) Dahlke, Mark, author; Breidt, F. Jay, advisor; Opsomer, Jean, committee member; Lee, Myung-Hee, committee member; Pezeshki, Ali, committee memberTwo topics related to the common theme of nonparametric techniques in survey sampling are examined. The first topic explores the estimation of a finite population mean via post-stratification. Post-stratification is used to improve the precision of survey estimators when categorical auxiliary information is available from external sources. In natural resource surveys, such information may be obtained from remote sensing data classified into categories and displayed as maps. These maps may be based on classification models fitted to the sample data. Such "endogenous post-stratification" violates the standard assumptions that observations are classified without error into post-strata, and post-stratum population counts are known. Properties of the endogenous post-stratification estimator (EPSE) are derived for the case of sample-fitted nonparametric models, with particular emphasis on monotone regression models. Asymptotic properties of the nonparametric EPSE are investigated under a superpopulation model framework. Simulation experiments illustrate the practical effects of first fitting a nonparametric model to survey data before post-stratifying. The second topic explores the use of instrumental variables to estimate regression coefficients. Informative sampling in survey problems occurs when the inclusion probabilities depend on the values of the study variable. In a regression setting under this sampling scheme, ordinary least squares estimators are biased and inconsistent. Given inverse inclusion probabilities as weights for the sample, various consistent estimators can be constructed. In particular, weighted covariates can be used as instrumental variables, allowing for calculation of a consistent, classical two-stage least squares estimator. The proposed estimator uses a similar two-stage process, but with penalized splines at the first stage. Consistency and asymptotic normality of the new estimator are established. The estimator is asymptotically unbiased, but has a finite-sample bias that is analytically characterized. Selection of an optimal smoothing parameter is shown to reduce the finite-sample variance, in comparison to that of the classical two-stage least squares estimator, offsetting the bias and providing an estimator with a reduced mean square error.Item Open Access Testing and adjusting for informative sampling in survey data(Colorado State University. Libraries, 2014) Herndon, Wade Wilson, author; Breidt, F. Jay, advisor; Opsomer, Jean, advisor; Cooley, Daniel, committee member; Meyer, Mary, committee member; Doherty, Paul, committee memberFitting models to survey data can be problematic due to the potentially complex sampling mechanism through which the observed data are selected. Survey weights have traditionally been used to adjust for unequal inclusion probabilities under the design-based paradigm of inference, however, this limits the ability of analysts to make inference of a more general kind, such as to characteristics of a superpopulation. The problems induced by the presence of a complex sampling design can be generally contained under the heading of informative sampling. To say that the sampling is informative is to say that the distribution of the data in the sample is different from the distribution of the data in the population. Two major topics relating to analyzing survey data with (potentially) informative sampling are addressed: testing for informativeness, and model building in the presence of informative sampling. First addressed is the problem of running formal tests for informative sampling in survey data. The major contribution contained here is to detail a new test for informative sampling. The test is shown to be widely applicable and straight-forward to implement in practice, and also useful compared to existing tests. The test is illustrated through a variety of empirical studies as well. These applications include a censored regression problem, linear regression, logistic regression, and fitting a gamma mixture model. Results from the analogous bootstrap test are also presented; these results agree with the analytic versions of the test. Alternative tests for informative sampling do in fact exist, however, the existing methods each have significant drawbacks and limitations which may be resolved in some situation with this new methodology, and overall the literature is quite sparse in this area. In a simulation study, the test is shown to have many desirable properties and maintains high power compared to alternative tests. Also included is discussion about the limiting distribution of the test statistic under a sequence of local alternative hypotheses, and some extensions that are useful in connecting the work contained here with some of the previous work in the area. These extensions also help motivate the semiparametric methods considered in the chapter that follows. The next topic explored is semiparametric methods for including design information in a regression model while staying within a model-based inferential framework. The ideas explored here attempt to exploit relationships between design variables (such as the sample inclusion probabilities) and model covariates. In order to account for the complex sampling design and (potential) bias in estimating model parameters, design variables are included as covariates and considered to be functions of the model covariates that can then be estimated in a design-based paradigm using nonparametric methods. The nonparametric method explored here is kernel smoothing with degree zero. In principle, other (and more complex) kinds of estimators could be used to estimate the functions of the design variables conditional on the model covariates, but the framework presented here provides asymptotic results for only the more simple case of kernel smoothing. The method is illustrated via empirical applications and also through a simulation study in which confidence band coverage rates from the semiparametric method are compared to those obtained through regular linear regression. The semiparametric estimator soundly outperforms the regression estimator.Item Open Access Topics in estimation for messy surveys: imperfect matching and nonprobability sampling(Colorado State University. Libraries, 2022) Huang, Chien-Min, author; Breidt, F. Jay, advisor; Wang, Haonan, committee member; Keller, Joshua, committee member; Pallickara, Sangmi, committee memberTwo problems in estimation for "messy" surveys are addressed, both requiring the combination of survey data with other data sources. The first estimation problem involves the combination of survey data with auxiliary data, when the matching of the two sources is imperfect. Model-assisted survey regression estimators combine auxiliary information available at a population level with complex survey data to estimate finite population parameters. Many prediction methods, including linear and mixed models, nonparametric regression, and machine learning techniques, can be incorporated into such model-assisted estimators. These methods assume that observations obtained for the sample can be matched without error to the auxiliary data. We investigate properties of estimators that rely on matching algorithms that do not in general yield perfect matches. We focus on difference estimators, which are exactly unbiased under perfect matching but not under imperfect matching. The methods are investigated analytically and via simulation, using a study of recreational angling in South Carolina to build a simulation population. In this study, the survey data come from a stratified, two-stage sample and the auxiliary data from logbooks filed by boat captains. Extensions to multiple frame estimators under imperfect matching are discussed. The second estimation problem involves the combination of survey data from a probability sample with additional data from a nonprobability sample. The problem is motivated by an application in which field crews are allowed to use their judgment in selecting part of a sample. Many surveys are conducted in two or more stages, with the first stage of primary sampling units dedicated to screening for secondary sampling units of interest, which are then measured or subsampled. The Large Pelagics Intercept Survey, conducted by the United States National Marine Fisheries Service, draws a probability sample of fishing access site-days in the first stage and screens for relatively rare fishing trips that target pelagic species (tuna, sharks, billfish, etc.). Many site-days yield no pelagic trips. Motivated by this low yield, we consider surveys that allow expert judgment in the selection of some site-days. This nonprobability judgment sample is combined with a probability sample to generate likelihood-based estimates of inclusion probabilities and estimators of population totals that are related to dual-frame estimators. Consistency and asymptotic normality of the estimators are established under the correct specification of the model for judgment behavior. An extensive simulation study shows the robustness of the methodology to misspecification of the judgment behavior. A standard variance estimator, readily available in statistical software, yields stable estimates with small negative bias and good confidence interval coverage. Across a range of conditions, the proposed strategy that allows for some judgment dominates the classic strategy of pure probability sampling with known design weights. The methodology is extended to a doubly-robust version that uses both a propensity model for judgment selection probabilities and a regression model for study variable characteristics. If either model is correctly specified, the doubly-robust estimator is unbiased. The dual-frame methodology for samples incorporating expert judgment is then extended to two other nonprobability settings: respondent-driven sampling and biased-frame sampling.Item Open Access Unbiased ratio estimation for finite populations(Colorado State University. Libraries, 2008) Al-Jararha, Jehad, author; Breidt, F. Jay, advisorIn many sample surveys from finite populations, the value of an auxiliary variable x is available (at least in aggregate form) for the entire finite population, and is correlated with the study variable of interest y. This auxiliary variable can be used to improve the precision of the estimator of the y-total.