Department of Statistics
Permanent URI for this community
These digital collections include theses, dissertations, and datasets from the Department of Statistics. Due to departmental name changes, materials from the following historical department are also included here: Mathematics and Statistics.
Browse
Browsing Department of Statistics by Issue Date
Now showing 1 - 20 of 95
Results Per Page
Sort Options
Item Open Access Estimation and linear prediction for regression, autoregression and ARMA with infinite variance data(Colorado State University. Libraries, 1983) Cline, Daren B. H., author; Resnick, Sidney I., advisor; Brockwell, Peter J., advisor; Locker, John, committee member; Davis, Richard A., committee member; Boes, Duane C., committee memberThis dissertation is divided into four parts, each of which considers random variables from distributions with regularly varying tails and/or in a stable domain of attraction. Part I considers the existence of infinite series of an independent sequence of such random variables and the relationship of the probability of large values of the series to the probability of large values of the first component. Part II applies Part I in order to provide a linear predictor for ARMA time series (again with regularly varying tails). This predictor is designed to minimize the probability of large prediction errors relative to the tails of the noise distribution. Part III investigates the products of independent random variables where one has distribution in a stable domain of attraction and gives conditions for which the product distribution is in a stable domain of attraction. Part IV considers estimation of the regression parameter in a model where the independent variables are in a stable domain of attraction. Consistency for certain M-estimators is proved. Utilizing portions of Part III this final part gives necessary and sufficient conditions for consistency of least squares estimators and provides the asymptotic distribution of least squares estimators.Item Open Access Population size estimation using the modified Horvitz-Thompson estimator with estimated sighting probability(Colorado State University. Libraries, 1996) Wong, Char-Ngan, author; Bowden, David C., advisorWildlife aerial population surveys usually use a two-stage sampling technique. The first stage involves dividing the whole survey area into smaller land units, which we called the primary units, and then taking a sample from those. In the second stage, an aerial survey of the selected units is made in an attempt to observe (count) every animal. Some animals, usually occurring in groups, are not observed for a variety of reasons. Estimates from these surveys are plagued with two major sources of errors, namely, errors due to sampling variation in both stages. The first error may be controlled by choosing a suitable sampling plan for the first stage. The second error is also termed "visibility bias", which acknowledges that only a portion of the groups in a sampled land unit will be enumerated. The objective of our study is to provide improved variance estimators over those provided by Steinhorst and Samuel (1989) and to evaluate performances of various corresponding interval procedures for estimating population size. For this purpose, we have found an asymptotically unbiased estimator for the approximate variance of the population size estimator when sighting probabilities of groups are unknown and fitted with a logistic model. We have broken down the approximate variance term into three components, namely, error due to sampling of primary units, error due to sighting of groups in second stage sampling and error due all three components separately in order to get a better insight to error control. Simplified versions of variance estimators are provided when all primary units are surveyed and for stratified random sampling of primary units. Third central moment of population size estimator was also obtained. Simulation studies were conducted to evaluate performances of our asymptotically unbiased variance estimators and of confidence interval procedures such as the large sample procedure, with and without transformation, for constructing 90% and 95% confidence intervals for the population size. Confidence intervals for the population size were also constructed by assuming that the distribution of log(T-T) is normally distributed, where f is the population size estimate and T is the number of animals seen in a sample obtained from a population survey. From our simulation results, we observed that the population size is estimated with negligible bias (according to Cochran's (1977) working rule) with a sample of at least 100 groups of elk obtained from a population survey when sighting probabilities are known. When sighting probabilities are unknown, one needs to conduct a sightability survey to obtain a sample, independent of the sample obtained from a population survey, for fitting a logistic model to estimate sighting probabilities of sighted groups in the sample obtained from the population survey. In this case, the population size is also estimated with negligible bias when the sample size of both samples is at least 100 groups of elk. We also observed that when sighting probabilities are known, we needed a sample of at least 348 groups of elk from a population survey to obtain reasonable coverage rates of the true population size. When sighting probabilities are unknown and estimated via logistic regression, the size of both samples is at least 428 groups of elk for obtaining reasonable coverage rates of the true population size. Among all these confidence intervals, we found that those approximate confidence intervals constructed based on the assumption that log (T-T) is normally distributed and using the delta method have better coverage rates and shorter estimated expected interval widths. Confidence intervals for the population size using bootstrapping were also evaluated. We were unable to find an existing bootstrapping procedure which could be directly applied to our problem. We have, therefore, proposed a couple of bootstrapping procedures for obtaining a sample to fit a logistic model and a couple of bootstrapping procedures for obtaining a sample to construct a population size estimate. With 1000 pairs of independent samples from a sightability survey and a population survey, each sample of size 107 groups of elk and using 500 bootstrap iterations, we obtained reasonable coverage rates of the true population size. Our other problem is model selection of a logistic model for the unknown sighting probabilities. We evaluated the performance of the population size estimator and our variance estimator when we fit a simpler model. For this purpose, we have derived theoretical expressions for the bias of the population size estimator and the mean-squared-error. We found, from our simulation results of fitting a couple of models simpler than the full model, that the population size was still well estimated for the fitted model based only on group size but was severely overestimated for the fitted model based only on percent of vegetation cover. For both fitted models, our variance estimator overestimated the observed variance of 1000 simulated population size estimates. We also found that the approximate expression of the expected value of the population size estimator we derived for a fitted model simpler than the full model has negligible bias (by Cochran's (1977) working rule) relative to the average of those 1000 simulated population size estimates. The approximate expression of the variance of the population size estimator we derived for this case somewhat underestimated the observed variance of those 1000 simulated population size estimates. Both approximate expressions apparently give us an idea of the expected size of the population size estimate and its variance when the fitted model is not the full model.Item Open Access The pooling of prior distributions via logarithmic and supra-Bayesian methods with application to Bayesian inference in deterministic simulation models(Colorado State University. Libraries, 1998) Roback, Paul J., author; Givens, Geof, advisor; Hoeting, Jennifer, committee member; Howe, Adele, committee member; Tweedie, Richard, committee memberWe consider Bayesian inference when priors and likelihoods are both available for inputs and outputs of a deterministic simulation model. Deterministic simulation models are used frequently by scientists to describe natural systems, and the Bayesian framework provides a natural vehicle for incorporating uncertainty in a deterministic model. The problem of making inference about parameters in deterministic simulation models is fundamentally related to the issue of aggregating (i. e. pooling) expert opinion. Alternative strategies for aggregation are surveyed and four approaches are discussed in detail- logarithmic pooling, linear pooling, French-Lindley supra-Bayesian pooling, and Lindley-Winkler supra-Bayesian pooling. The four pooling approaches are compared with respect to three suitability factors-theoretical properties, performance in examples, and the selection and sensitivity of hyperparameters or weightings incorporated in each method and the logarithmic pool is found to be the most appropriate pooling approach when combining exp rt opinions in the context of deterministic simulation models. We develop an adaptive algorithm for estimating log pooled priors for parameters in deterministic simulation models. Our adaptive estimation approach relies on importance sampling methods, density estimation techniques for which we numerically approximate the Jacobian, and nearest neighbor approximations in cases in which the model is noninvertible. This adaptive approach is compared to a nonadaptive approach over several examples ranging from a relatively simple R1 → R1 example with normally distributed priors and a linear deterministic model, to a relatively complex R2 → R2 example based on the bowhead whale population model. In each case, our adaptive approach leads to better and more efficient estimates of the log pooled prior than the nonadaptive estimation algorithm. Finally, we extend our inferential ideas to a higher-dimensional, realistic model for AIDS transmission. Several unique contributions to the statistical discipline are contained in this dissertation, including: 1. the application of logarithmic pooling to inference in deterministic simulation models; 2. the algorithm for estimating log pooled priors using an adaptive strategy; 3. the Jacobian-based approach to density estimation in this context, especially in higher dimensions; 4. the extension of the French-Lindley supra-Bayesian methodology to continuous parameters; 5. the extension of the Lindley-Winkler supra-Bayesian methodology to multivariate parameters; and, 6. the proofs and illustrations of the failure of Relative Propensity Consistency under the French-Lindley supra-Bayesian approach.Item Open Access Model selection based on expected squared Hellinger distance(Colorado State University. Libraries, 2007) Cao, Xiaofan, author; Iyer, Hariharan K., advisor; Wang, Haonan, advisorThis dissertation is motivated by a general model selection problem such that the true model is unknown and one or more approximating parametric families of models are given along with strategies for estimating the parameters using data. We develop model selection methods based on Hellinger distance that can be applied to a wide range of modeling problems without posing the typical assumptions for the true model to be within the approximating families or to come from a particular parametric family. We propose two estimators for the expected squared Hellinger distance as the model selection criteria.Item Open Access State-space models for stream networks(Colorado State University. Libraries, 2007) Coar, William J., author; Breidt, F. Jay, advisorThe natural branching that occurs in a stream network, in which two upstream reaches merge to create a new downstream reach, generates a tree structure. Furthermore, because of the natural flow of water in a stream network, characteristics of a downstream reach may depend on characteristics of upstream reaches. Since the flow of water from reach to reach provides a natural time-like ordering throughout the stream network, we propose a state-space model to describe the spatial dependence in this tree-like structure with ordering based on flow. Developing a state-space formulation permits the use of the well known Kalman recursions. Variations of the Kalman Filter and Smoother are derived for the tree-structured state-space model, which allows recursive estimation of unobserved states and prediction of missing observations on the network, as well as computation of the Gaussian likelihood, even when the data are incomplete. To reduce the computational burden that may be associated with optimization of this exact likelihood, a version of the expectation-maximization (EM) algorithm is presented that uses the Kalman Smoother to fill in missing values in the E-step, and maximizes the Gaussian likelihood for the completed dataset in the M-step. Several forms of dependence for discrete processes on a stream network are considered, such as network analogues of the autoregressive-moving average model and stochastic trend models. Network parallels for first and second differences in time-series are defined, which allow for definition of a spline smoother on a stream network through a special case of a local linear trend model. We have taken the approach of modeling a discrete process, which we see as a building block to more appropriate yet more complicated models. Adaptation of this state-space model and Kalman prediction equations to allow for more complicated forms of spatial and perhaps temporal dependence is a potential area of future research. Other possible directions for future research are non-Gaussian and nonlinear error structures, model selection, and properties of estimators.Item Open Access Spatial processes with stochastic heteroscedasticity(Colorado State University. Libraries, 2008) Huang, Wenying, author; Breidt, F. Jay, advisor; Davis, Richard A., advisorStationary Gaussian processes are widely used in spatial data modeling and analysis. Stationarity is a relatively restrictive assumption regarding spatial association. By introducing stochastic volatility into a Gaussian process, we propose a stochastic heteroscedastic process (SHP) with conditional nonstationarity. That is, conditional on a latent Gaussian process, the SHP is a Gaussian process with non-stationary covariance structure. Unconditionally, the SHP is a stationary non-Gaussian process. The realizations from SHP are versatile and can represent spatial inhomogeneities. The unconditional correlation of SHP offers a rich class of correlation functions which can also allow for a smoothed nugget effect. For maximum likelihood estimation, we propose to apply importance sampling in the likelihood calculation and latent process estimation. The importance density we constructed is of the same dimensionality as the observations. When the sample size is large, the importance sampling scheme becomes infeasible and/or inaccurate. A low-dimensional approximation model is developed to solve the numerical difficulties. We develop two spatial prediction methods: PBP (plug-in best predictor) and PBLUP (plug-in best linear unbiased predictor). Empirical results with simulated and real data show improved out-of-sample prediction performance of SHP modeling over stationary Gaussian process modeling. We extend the single-realization model to SHP model with replicates. The spatial replications are modeled as independent realizations from a SHP model conditional on a common latent process. A simulation study shows substantial improvements in parameter estimation and process prediction when replicates are available. In a example with real atmospheric deposition data, the SHP model with replicates outperforms the Gaussian process model in prediction by capturing the spatial volatilities.Item Open Access Estimation of structural breaks in nonstationary time series(Colorado State University. Libraries, 2008) Hancock, Stacey, author; Davis, Richard A., advisor; Iyer, Hari K., advisorMany time series exhibit structural breaks in a variety of ways, the most obvious being a mean level shift. In this case, the mean level of the process is constant over periods of time, jumping to different levels at times called change-points. These jumps may be due to outside influences such as changes in government policy or manufacturing regulations. Structural breaks may also be a result of changes in variability or changes in the spectrum of the process. The goal of this research is to estimate where these structural breaks occur and to provide a model for the data within each stationary segment. The program Auto-PARM (Automatic Piecewise AutoRegressive Modeling procedure), developed by Davis, Lee, and Rodriguez-Yam (2006), uses the minimum description length principle to estimate the number and locations of change-points in a time series by fitting autoregressive models to each segment. The research in this dissertation shows that when the true underlying model is segmented autoregressive, the estimates obtained by Auto-PARM are consistent. Under a more general time series model exhibiting structural breaks, Auto-PARM's estimates of the number and locations of change-points are again consistent, and the segmented autoregressive model provides a useful approximation to the true process. Weak consistency proofs are given, as well as simulation results when the true process is not autoregressive. An example of the application of Auto-PARM as well as a source of inspiration for this research is the analysis of National Park Service sound data. This data was collected by the National Park Service over four years in around twenty of the National Parks by setting recording devices in several sites throughout the parks. The goal of the project is to estimate the amount of manmade sound in the National Parks. Though the project is in its initial stages, Auto-PARM provides a promising method for analyzing sound data by breaking the sound waves into pseudo-stationary pieces. Once the sound data have been broken into pieces, a classification technique can be applied to determine the type of sound in each segment.Item Open Access Data mining techniques for temporal point processes applied to insurance claims data(Colorado State University. Libraries, 2008) Iverson, Todd Ashley, author; Ben-Hur, Asa, advisor; Iyer, Hariharan K., advisorWe explore data mining on databases consisting of insurance claims information. This dissertation focuses on two major topics we considered by way of data mining procedures. One is the development of a classification rule using kernels and support vector machines. The other is the discovery of association rules using the Apriori algorithm, its extensions, as well as a new association rules technique. With regard to the first topic we address the question-can kernel methods using an SVM classifier be used to predict patients at risk of type 2 diabetes using three years of insurance claims data? We report the results of a study in which we tested the performance of new methods for data extracted from the MarketScan® database. We summarize the results of applying popular kernels, as well as new kernels constructed specifically for this task, for support vector machines on data derived from this database. We were able to predict patients at risk of type 2 diabetes with nearly 80% success when combining a number of specialized kernels. The specific form of the data, that of a timed sequence, led us to develop two new kernels inspired by dynamic time warping. The Global Time Warping (GTW) and Local Time Warping (LTW) kernels build on an existing time warping kernel by including the timing coefficients present in classical time warping, while providing a solution for the diagonal dominance present in most alignment methods. We show that the LTW kernel performs significantly better than the existing time warping kernel when the times contained relevant information. With regard to the second topic, we provide a new theorem on closed rules that could help substantially improve the time to find a specific type of rule. An insurance claims database contains codes indicating associated diagnoses and the resulting procedures for each claim. The rules that we consider are of the form diagnoses imply procedures. In addition, we introduce a new class of interesting association rules in the context of medical claims databases and illustrate their potential uses by extracting example rules from the MarketScan® database.Item Open Access Penalized estimation for sample surveys in the presence of auxiliary variables(Colorado State University. Libraries, 2008) Delorey, Mark J., author; Breidt, F. Jay, advisorIn conducting sample surveys, time and financial resources can be limited but research questions are wide and varied. Thus, methods for analysis must make the best use of whatever data are available and produce results that address a variety of needs. Motivation for this research comes from surveys of aquatic resources, in which sample sizes are small to moderate, but auxiliary information is available to supplement measured survey responses. The problems of survey estimation are considered, tied together in their use of constrained/penalized estimation techniques for combining information from the auxiliary information and the responses of interest. We study a small area problem with the goal of obtaining a good ensemble estimate, that is, a collection of estimates for individual small areas that collectively give a good estimate of the overall distribution function across small areas. Often, estimators that are good for one purpose may not be good for others. For example, estimation of the distribution function itself (as in Cordy and Thomas, 1997) can address questions of variability and extremes but does not provide individual estimators of the small areas, nor is it appropriate when auxiliary information can be made of use. Bayes estimators are good individual estimators in terms of mean squared error but are not variable enough to represent ensemble traits (Ghosh, 1992). An algorithm that extends the constrained Bayes (CB) methods of Louis (1984) and Ghosh (1992) for use in a model with a general covariance matrix is presented. This algorithm produces estimators with similar properties as (CB), and we refer to this method as general constrained Bayes (GCB). The ensemble GCB estimator is asymptotically unbiased for the posterior mean of the empirical distribution function (edf). The ensemble properties of transformed GCB estimates are investigated to determine if the desirable ensemble characteristics displayed by the GCB estimator are preserved under such transformations. The GCB algorithm is then applied to complex models such as conditional autoregressive spatial models and to penalized spline models. Illustrative examples include the estimation of lip cancer risk, mean water acidity, and rates of change in water acidity. We also study a moderate area problem in which the goal is to derive a set of survey weights that can be applied to each study variable with reasonable predictive results. Zheng and Little (2003) use penalized spline regression in a model-based approach for finite population estimation in a two-stage sample when predictor variables are available. Breidt et al. (2005) propose a class of model-assisted estimators based on penalized spline regression in single stage sampling. Because unbiasedness of the model-based estimator requires that the model be correctly specified, we look at extending model-assisted estimation to the two-stage case. By calibrating the degrees of freedom of the smooth to the most important study variables, a set of weights can be obtained that produce design consistent estimators for all study variables. The model-assisted estimator is compared to other estimators in a simulation study. Results from the simulation study show that the model-assisted estimator is comparable to other estimators when the model is correctly specified and generally superior when the model is incorrectly specified.Item Open Access Estimation for Lévy-driven CARMA processes(Colorado State University. Libraries, 2008) Yang, Yu, author; Brockwell, Peter J., advisor; Davis, Richard A., advisorThis thesis explores parameter estimation for Lévy-driven continuous-time autoregressive moving average (CARMA) processes, using uniformly and closely spaced discrete-time observations. Specifically, we focus on developing estimation techniques and asymptotic properties of the estimators for three particular families of Lévy-driven CARMA processes. Estimation for the first family, Gaussian autoregressive processes, was developed by deriving exact conditional maximum likelihood estimators of the parameters under the assumption that the process is observed continuously. The resulting estimates are expressed in terms of stochastic integrals which are then approximated using the available closely-spaced discrete-time observations. We apply the results to both linear and non-linear autoregressive processes. For the second family, non-negative Lévy-driven Ornestein-Uhlenbeck processes, we take advantage of the non-negativity of the increments of the driving Lévy process to derive a highly efficient estimation procedure for the autoregressive coefficient when observations are available at uniformly spaced times. Asymptotic properties of the estimator are also studied and a procedure for obtaining estimates of the increments of the driving Lévy process is developed. These estimated increments are important for identifying the nature of the driving Lévy process and for estimating its parameters. For the third family, non-negative Lévy-driven CARMA processes, we estimate the coefficients by maximizing the Gaussian likelihood of the observations and discuss the asymptotic properties of the estimators. We again show how to estimate the increments of the background driving Lévy process and hence to estimate the parameters of the Lévy process itself. We assess the performance of our estimation procedures by simulations and use them to fit models to real data sets in order to determine how the theory applies in practice.Item Open Access Spatial models with applications in computer experiments(Colorado State University. Libraries, 2008) Wang, Ke, author; Davis, Richard A., advisor; Breidt, F. Jay, advisorOften, a deterministic computer response is modeled as a realization from a, stochastic process such as a Gaussian random field. Due to the limitation of stationary Gaussian process (GP) in inhomogeneous smoothness, we consider modeling a deterministic computer response as a realization from a stochastic heteroskedastic process (SHP), a stationary non-Gaussian process. Conditional on a latent process, the SHP has non-stationary covariance function and is a non-stationary GP. As such, the sample paths of this process exhibit greater variability and hence offer more modeling flexibility than those produced by a, traditional GP model. We use maximum likelihood for inference in the SHP model, which is complicated by the high dimensionality of the latent process. Accordingly, we develop an importance sampling method for likelihood computation and use a low-rank kriging approximation to reconstruct the latent process. Responses at unobserved locations can be predicted using empirical best predictors or by empirical best linear unbiased predictors. In addition, prediction error variances are obtained. The SHP model can be used in an active learning context, adaptively selecting new locations that provide improved estimates of the response surface. Estimation, prediction, and adaptive sampling with the SHP model are illustrated with several examples. Our spatial model can be adapted to model the first partial derivative process. The derivative process provides additional information about the shape and smoothness of the underlying deterministic function and can assist in the prediction of responses at unobserved sites. The unconditional correlation function for the derivative process presents some interesting properties, and can be used as a new class of spatial correlation functions. For parameter estimation, we propose to use a similar strategy to develop an importance sampling technique to compute the joint likelihood of responses and derivatives. The major difficulties of bringing in derivative information are the increase in the dimensionality of the latent process and the numerical problems of inverting the enlarged covariance matrix. Some possible ways to utilize this information more efficiently are proposed.Item Open Access Statistical modeling with COGARCH(p,q) processes(Colorado State University. Libraries, 2009) Chadraa, Erdenebaatar, author; Brockwell, Peter J., advisorIn this paper, a family of continuous time GARCH processes, generalizing the COGARCH(1, 1) process of Klüppelberg, et al. (2004), is introduced and studied. The resulting COGARCH(p,q) processes, q ≥ p ≥ 1, exhibit many of the characteristic features of observed financial time series, while their corresponding volatility and squared increment processes display a broader range of autocorrelation structures than those of the COGARCH(1, 1) process. We establish sufficient conditions for the existence of a strictly stationary non-negative solution of the equations for the volatility process and, under conditions which ensure the finiteness of the required moments, determine the autocorrelation functions of both the volatility and squared increment processes. The volatility process is found to have the autocorrelation function of a continuous-time ARMA process while the squared increment process has the autocorrelation function of an ARMA process.Item Open Access Applications of generalized fiducial inference(Colorado State University. Libraries, 2009) E, Lidong, author; Iyer, Hariharan K., advisorHannig (2008) generalized Fisher's fiducial argument and obtained a fiducial recipe for interval estimation that is applicable in virtually any situation. In this dissertation research, we apply this fiducial recipe and fiducial generalized pivotal quantity to make inference in four practical problems. The list of problems we consider is (a) confidence intervals for variance components in an unbalanced two-component normal mixed linear model; (b) confidence intervals for median lethal dose (LD50) in bioassay experiments; (c) confidence intervals for the concordance correlation coefficient (CCC) in method comparison; (d) simultaneous confidence intervals for ratios of means of Lognormal distributions. For all the fiducial generalized confidence intervals (a)-(d), we conducted a simulation study to evaluate their performance and compare them with other competing confidence interval procedures from the literature. We also proved that the intervals (a) and (d) have asymptotically exact frequentist coverage.Item Open Access Confidence regions for level curves and a limit theorem for the maxima of Gaussian random fields(Colorado State University. Libraries, 2009) French, Joshua, author; Davis, Richard A., advisorOne of the most common display tools used to represent spatial data is the contour plot. Informally, a contour plot is created by taking a "slice" of a three-dimensional surface at a certain level of the response variable and projecting the slice onto the two-dimensional coordinate-plane. The "slice" at each level is known as a level curve.Item Open Access A fiducial approach to extremes and multiple comparisons(Colorado State University. Libraries, 2010) Wandler, Damian V., author; Hannig, Jan, advisor; Iyer, Hariharan K., advisor; Chong, Edwin Kah Pin, committee member; Wang, Haonan, committee memberGeneralized fiducial inference is a powerful tool for many difficult problems. Based on an extension of R. A. Fisher's work, we used generalized fiducial inference for two extreme value problems and a multiple comparison procedure. The first extreme value problem is dealing with the generalized Pareto distribution. The generalized Pareto distribution is relevant to many situations when modeling extremes of random variables. We use a fiducial framework to perform inference on the parameters and the extreme quantiles of the generalized Pareto. This inference technique is demonstrated in both cases when the threshold is a known and unknown parameter. Simulation results suggest good empirical properties and compared favorably to similar Bayesian and frequentist methods. The second extreme value problem pertains to the largest mean of a multivariate normal distribution. Difficulties arise when two or more of the means are simultaneously the largest mean. Our solution uses a generalized fiducial distribution and allows for equal largest means to alleviate the overestimation that commonly occurs. Theoretical calculations, simulation results, and application suggest our solution possesses promising asymptotic and empirical properties. Our solution to the largest mean problem arose from our ability to identify the correct largest mean(s). This essentially became a model selection problem. As a result, we applied a similar model selection approach to the multiple comparison problem. We allowed for all possible groupings (of equality) of the means of k independent normal distributions. Our resulting fiducial probability for the groupings of the means demonstrates the effectiveness of our method by selecting the correct grouping at a high rate.Item Open Access Nonparametric function smoothing: fiducial inference of free knot splines and ecological applications(Colorado State University. Libraries, 2010) Sonderegger, Derek Lee, author; Wang, Haonan, advisor; Hannig, Jan, advisor; Noon, Barry R. (Barry Richard), 1949-, committee member; Iyer, Hariharan K., committee memberNonparametric function estimation has proven to be a useful tool for applied statisticians. Classic techniques such as locally weighted regression and smoothing splines are being used in a variety of circumstances to address questions at the forefront of ecological theory. We first examine an ecological threshold problem and define a threshold as where the derivative of the estimated functions changes states (negative, possibly zero, or positive) and present a graphical method that examines the state changes across a wide interval of smoothing levels. We apply this method to macro-invertabrate data from the Arkansas River. Next we investigate a measurement error model and a generalization of the commonly used regression calibration method whereby a nonparametric function is used instead of a linear function. We present a simulation study to assess the effectiveness of the method and apply the method to a water quality monitoring data set. The possibility of defining thresholds as knot point locations in smoothing splines led to the investigation of the fiducial distribution of free-knot splines. After introducing the theory behind fiducial inference, we then derive conditions sufficient to for asymptotic normality of the multivariate fiducial density. We then derive the fiducial density for an arbitrary degree spline with an arbitrary number of knot points. We then show that free-knot splines of degree 3 or greater satisfy the asymptotic normality conditions. Finally we conduct a simulation study to assess quality of the fiducial solution compared to three other commonly used methods.Item Open Access Saddlepoint approximation to functional equations in queueing theory and insurance mathematics(Colorado State University. Libraries, 2010) Chung, Sunghoon, author; Butler, Ronald W., advisor; Scharf, Louis L., committee member; Chapman, Phillip L., committee member; Hoeting, Jennifer A. (Jennifer Ann), 1966-, committee memberWe study the application of saddlepoint approximations to statistical inference when the moment generating function (MGF) of the distribution of interest is an explicit or an implicit function of the MGF of another random variable which is assumed to be observed. In other words, let W (s) be the MGF of the random variable W of interest. We study the case when W (s) = h{G (s) ; λ}, where G (s) is an MGF of G for which a random sample can be obtained, and h is a smooth function. If Ĝ (s) estimates G (s), then Ŵ (s) = h{Ĝ (s) ; λ̂} estimates W (s). Generally, it can be shown that Ŵ (s) converges to W (s) by the strong law of large numbers, which implies that F̂ (t), the cumulative distribution function (CDF) corresponding to Ŵ (s), converges to F (t), the CDF of W, almost surely. If we set Ŵ* (s) = h{Ĝ* (s) ; λ̂}, where Ĝ* (s) and λ̂* are the empirical MGF and the estimator of λ from bootstrapping, the corresponding CDF F̂* (t) can be used to construct the confidence band of F(t). In this dissertation, we show that the saddlepoint inversion of Ŵ (s) is not only fast, reliable, stable, and accurate enough for a general statistical inference, but also easy to use without deep knowledge of the probability theory regarding the stochastic process of interest. For the first part, we consider nonparametric estimation of the density and the CDF of the stationary waiting times W and Wq of an M/G/1 queue. These estimates are computed using saddlepoint inversion of Ŵ (s) determined from the Pollaczek-Khinchin formula. Our saddlepoint estimation is compared with estimators based on other approximations, including the Cramér-Lundberg approximation. For the second part, we consider the saddlepoint approximation for the busy period distribution FB (t) in a M/G/1 queue. The busy period B is the first passage time for the queueing system to pass from an initial arrival (1 in the system) to 0 in the system. If B (s) is the MGF of B, then B (s) is an implicitly defined function of G (s) and λ, the inter-arrival rate, through the well-known Kendall-Takács functional equation. As in the first part, we show that the saddlepoint approximation can be used to obtain F̂B (t), the CDF corresponding to B̂(s) and simulation results show that confidence bands of FB (t) based on bootstrapping perform well.Item Open Access Improved estimation for complex surveys using modern regression techniques(Colorado State University. Libraries, 2011) McConville, Kelly, author; Breidt, F. Jay, advisor; Lee, Thomas, C. M., advisor; Opsomer, Jean, committee member; Lee, Myung-Hee, committee member; Doherty, Paul F., committee memberIn the field of survey statistics, finite population quantities are often estimated based on complex survey data. In this thesis, estimation of the finite population total of a study variable is considered. The study variable is available for the sample and is supplemented by auxiliary information, which is available for every element in the finite population. Following a model-assisted framework, estimators are constructed that exploit the relationship which may exist between the study variable and ancillary data. These estimators have good design properties regardless of model accuracy. Nonparametric survey regression estimation is applicable in natural resource surveys where the relationship between the auxiliary information and study variable is complex and of an unknown form. Breidt, Claeskens, and Opsomer (2005) proposed a penalized spline survey regression estimator and studied its properties when the number of knots is fixed. To build on their work, the asymptotic properties of the penalized spline regression estimator are considered when the number of knots goes to infinity and the locations of the knots are allowed to change. The estimator is shown to be design consistent and asymptotically design unbiased. In the course of the proof, a result is established on the uniform convergence in probability of the survey-weighted quantile estimators. This result is obtained by deriving a survey-weighted Hoeffding inequality for bounded random variables. A variance estimator is proposed and shown to be design consistent for the asymptotic mean squared error. Simulation results demonstrate the usefulness of the asymptotic approximations. Also in natural resource surveys, a substantial amount of auxiliary information, typically derived from remotely-sensed imagery and organized in the form of spatial layers in a geographic information system (GIS), is available. Some of this ancillary data may be extraneous and a sparse model would be appropriate. Model selection methods are therefore warranted. The 'least absolute shrinkage and selection operator' (lasso), presented by Tibshirani (1996), conducts model selection and parameter estimation simultaneously by penalizing the sum of the absolute values of the model coefficients. A survey-weighted lasso criterion, which accounts for the sampling design, is derived and a survey-weighted lasso estimator is presented. The root-n design consistency of the estimator and a central limit theorem result are proved. Several variants of the survey-weighted lasso estimator are constructed. In particular, a calibration estimator and a ridge regression approximation estimator are constructed to produce lasso weights that can be applied to several study variables. Simulation studies show the lasso estimators are more efficient than the regression estimator when the true model is sparse. The lasso estimators are used to estimate the proportion of tree canopy cover for a region of Utah. Under a joint design-model framework, the survey-weighted lasso coefficients are shown to be root-N consistent for the parameters of the superpopulation model and a central limit theorem result is found. The methodology is applied to estimate the risk factors for the Zika virus from an epidemiological survey on the island of Yap. A logistic survey-weighted lasso regression model is fit to the data and important covariates are identified.Item Open Access Habitat estimation through synthesis of species presence/absence information and environmental covariate data(Colorado State University. Libraries, 2011) Dornan, Grant J., author; Givens, Geof H., advisor; Hoeting, Jennifer A., committee member; Chapman, Phillip L., committee member; Myrick, Christopher A., committee memberThis paper investigates the statistical model developed by Foster, et al. (2011) to estimate marine habitat maps based on environmental covariate data and species presence/absence information while treating habitat definition probabilistically. The model assumes that two sites belonging to the same habitat have approximately the same species presence probabilities, and thus both environmental data and species presence observations can help to distinguish habitats at locations across a study region. I develop a computational method to estimate the model parameters by maximum likelihood using a blocked non-linear Gauss-Seidel algorithm. The main part of my work is developing and conducting simulation studies to evaluate estimation performance and to study related questions including the impacts of sample size, model bias and model misspecification. Seven testing scenarios are developed including between 3 and 9 habitats, 15 and 40 species, and 150 and 400 sampling sites. Estimation performance is primarily evaluated through fitted habitat maps and is shown to be excellent for the seven example scenarios examined. Rates of successful habitat classification ranged from 0.92 to 0.98. I show that there is a roughly balanced tradeoff between increasing the number of sites and increasing the number of species for improving estimation performance. Standard model selection techniques are shown to work for selection of covariates, but selection of the number of habitats benefits from supplementing quantitative techniques with qualitative expert judgement. Although estimation of habitat boundaries is extremely good, the rate of probabilistic transition between habitats is shown to be difficult to estimate accurately. Future research should address this issue. An appendix to this thesis includes a comprehensive and annotated collection of R code developed during this project.Item Open Access Bayesian shape-restricted regression splines(Colorado State University. Libraries, 2011) Hackstadt, Amber J., author; Hoeting, Jennifer, advisor; Meyer, Mary, advisor; Opsomer, Jean, committee member; Huyvaert, Kate, committee memberSemi-parametric and non-parametric function estimation are useful tools to model the relationship between design variables and response variables as well as to make predictions without requiring the assumption of a parametric form for the regression function. Additionally, Bayesian methods have become increasingly popular in statistical analysis since they provide a flexible framework for the construction of complex models and produce a joint posterior distribution for the coefficients that allows for inference through various sampling methods. We use non-parametric function estimation and a Bayesian framework to estimate regression functions with shape restrictions. Shape-restricted functions include functions that are monotonically increasing, monotonically decreasing, convex, concave, and combinations of these restrictions such as increasing and convex. Shape restrictions allow researchers to incorporate knowledge about the relationship between variables into the estimation process. We propose Bayesian semi-parametric models for regression analysis under shape restrictions that use a linear combination of shape-restricted regression splines such as I-splines or C-splines. We find function estimates using Markov chain Monte Carlo (MCMC) algorithms. The Bayesian framework along with MCMC allows us to perform model selection and produce uncertainty estimates much more easily than in the frequentist paradigm. Indeed, some of the work proposed in this dissertation has not been developed in parallel in the frequentist paradigm. We begin by proposing a semi-parametric generalized linear model for regression analysis under shape-restrictions. We provide Bayesian shape-restricted regression spline (Bayes SRRS) models and MCMC estimation algorithms for the normal errors, Bernoulli, and Poisson models. We propose several types of inference that can be performed for the normal errors model as well as examine the asymptotic behavior of the estimates for the normal errors model under the monotone shape-restriction. We also examine the small sample behavior of the proposed Bayes SRRS model estimates via simulation studies. We then extend the semi-parametric Bayesian shape-restricted regression splines to generalized linear mixed models. We provide a MCMC algorithm to estimate functions for the random intercept model with normal errors under the monotone shape restriction. We then further extend the semi-parametric Bayesian shape-restricted regression splines to allow the number and location of the knot points for the regression splines to be random and propose a reversible jump Markov chain Monte Carlo (RJMCMC) algorithm for regression function estimation under the monotone shape restriction. Lastly, we propose a Bayesian shape-restricted regression spline change-point model where the regression function is shape-restricted except at the change-points. We provide RJMCMC algorithms to estimate functions with change-points where the number and location of interior knot points for the regression splines are random. We provide a RJMCMC algorithm to estimate the location of an unknown change-point as well as a RJMCMC algorithm to decide between a model with no change-points and model with a change-point.