# Theses and Dissertations

## Permanent URI for this collection

## Browse

### Browsing Theses and Dissertations by Title

Now showing 1 - 20 of 70

###### Results Per Page

###### Sort Options

Item Open Access A fiducial approach to extremes and multiple comparisons(Colorado State University. Libraries, 2010) Wandler, Damian V., author; Hannig, Jan, advisor; Iyer, Hariharan K., advisor; Chong, Edwin Kah Pin, committee member; Wang, Haonan, committee memberShow more Generalized fiducial inference is a powerful tool for many difficult problems. Based on an extension of R. A. Fisher's work, we used generalized fiducial inference for two extreme value problems and a multiple comparison procedure. The first extreme value problem is dealing with the generalized Pareto distribution. The generalized Pareto distribution is relevant to many situations when modeling extremes of random variables. We use a fiducial framework to perform inference on the parameters and the extreme quantiles of the generalized Pareto. This inference technique is demonstrated in both cases when the threshold is a known and unknown parameter. Simulation results suggest good empirical properties and compared favorably to similar Bayesian and frequentist methods. The second extreme value problem pertains to the largest mean of a multivariate normal distribution. Difficulties arise when two or more of the means are simultaneously the largest mean. Our solution uses a generalized fiducial distribution and allows for equal largest means to alleviate the overestimation that commonly occurs. Theoretical calculations, simulation results, and application suggest our solution possesses promising asymptotic and empirical properties. Our solution to the largest mean problem arose from our ability to identify the correct largest mean(s). This essentially became a model selection problem. As a result, we applied a similar model selection approach to the multiple comparison problem. We allowed for all possible groupings (of equality) of the means of k independent normal distributions. Our resulting fiducial probability for the groupings of the means demonstrates the effectiveness of our method by selecting the correct grouping at a high rate.Show more Item Open Access A penalized estimation procedure for varying coefficient models(Colorado State University. Libraries, 2015) Tu, Yan, author; Wang, Haonan, advisor; Breidt, F. Jay, committee member; Chapman, Phillip, committee member; Luo, J. Rockey, committee memberShow more Varying coefficient models are widely used for analyzing longitudinal data. Various methods for estimating coefficient functions have been developed over the years. We revisit the problem under the theme of functional sparsity. The problem of sparsity, including global sparsity and local sparsity, is a recurrent topic in nonparametric function estimation. A function has global sparsity if it is zero over the entire domain, and it indicates that the corresponding covariate is irrelevant to the response variable. A function has local sparsity if it is nonzero but remains zero for a set of intervals, and it identifies an inactive period of the corresponding covariate. Each type of sparsity has been addressed in the literature using the idea of regularization to improve estimation as well as interpretability. In this dissertation, a penalized estimation procedure has been developed to achieve functional sparsity, that is, simultaneously addressing both types of sparsity in a unified framework. We exploit the property of B-spline approximation and group bridge penalization. Our method is illustrated in simulation study and real data analysis, and outperforms the existing methods in identifying both local sparsity and global sparsity. Asymptotic properties of estimation consistency and sparsistency of the proposed method are established. The term of sparsistency refers to the property that the functional sparsity can be consistently detected.Show more Item Open Access Adjusting for capture, recapture, and identity uncertainty when estimating detection probability from capture-recapture surveys(Colorado State University. Libraries, 2015) Edmondson, Stacy L., author; Givens, Geof, advisor; Opsomer, Jean, committee member; Kokoszka, Piotr, committee member; Noon, Barry, committee memberShow more When applying capture-recapture analysis methods, estimates of detection probability, and hence abundance estimates, can be biased if individuals of a population are not correctly identified (Creel et. al., 2003). My research, motivated by the 2010 and 2011 surveys of Western Arctic bowhead whales conducted off the shores of Barrow, Alaska, offers two methods for addressing the complex scenario where an individual may be mistaken as another individual from that population, thus creating erroneous recaptures. The first method uses a likelihood weighted capture recapture method to account for three sources of uncertainty in the matching process. I illustrate this approach with a detailed application to the whale data. The second method develops an explicit model for match errors and uses MCMC methods to estimate model parameters. Implementation of this approach must overcome significant hurdles dealing with the enormous number and complexity of potential catch history configurations when matches are uncertain. The performance of this approach is evaluated using a large set of Monte Carlo simulation tests. Results of these test vary from good performance to weak performance, depending on factors including detection probability, number of sightings, and error rates. Finally, this model is applied to a portion of the bowhead survey data and found to produce plausible and scientifically informative results as long as the MCMC algorithm is started at a reasonable point in the space of possible catch history configurations.Show more Item Open Access Advances in statistical analysis and modeling of extreme values motivated by atmospheric models and data products(Colorado State University. Libraries, 2018) Fix, Miranda J., author; Cooley, Daniel, advisor; Hoeting, Jennifer, committee member; Wilson, Ander, committee member; Barnes, Elizabeth, committee memberShow more This dissertation presents applied and methodological advances in the statistical analysis and modeling of extreme values. We detail three studies motivated by the types of data found in the atmospheric sciences, such as deterministic model output and observational products. The first two investigations represent novel applications and extensions of extremes methodology to climate and atmospheric studies. The third investigation proposes a new model for areal extremes and develops methods for estimation and inference from the proposed model. We first detail a study which leverages two initial condition ensembles of a global climate model to compare future precipitation extremes under two climate change scenarios. We fit non-stationary generalized extreme value (GEV) models to annual maximum daily precipitation output and compare impacts under the RCP8.5 and RCP4.5 scenarios. A methodological contribution of this work is to demonstrate the potential of a "pattern scaling" approach for extremes, in which we produce predictive GEV distributions of annual precipitation maxima under RCP4.5 given only global mean temperatures for this scenario. We compare results from this less computationally intensive method to those obtained from our GEV model fitted directly to the RCP4.5 output and find that pattern scaling produces reasonable projections. The second study examines, for the first time, the capability of an atmospheric chemistry model to reproduce observed meteorological sensitivities of high and extreme surface ozone (O3). This work develops a novel framework in which we make three types of comparisons between simulated and observational data, comparing (1) tails of the O3 response variable, (2) distributions of meteorological predictor variables, and (3) sensitivities of high and extreme O3 to meteorological predictors. This last comparison is made using quantile regression and a recent tail dependence optimization approach. Across all three study locations, we find substantial differences between simulations and observational data in both meteorology and meteorological sensitivities of high and extreme O3. The final study is motivated by the prevalence of large gridded data products in the atmospheric sciences, and presents methodological advances in the (finite-dimensional) spatial setting. Existing models for spatial extremes, such as max-stable process models, tend to be geostatistical in nature as well as very computationally intensive. Instead, we propose a new model for extremes of areal data, with a common-scale extension, that is inspired by the simultaneous autoregressive (SAR) model in classical spatial statistics. The proposed model extends recent work on transformed-linear operations applied to regularly varying random vectors, and is unique among extremes models in being directly analogous to a classical linear model. We specify a sufficient condition on the spatial dependence parameter such that our extreme SAR model has desirable properties. We also describe the limiting angular measure, which is discrete, and corresponding tail pairwise dependence matrix (TPDM) for the model. After examining model properties, we then investigate two approaches to estimation and inference for the common-scale extreme SAR model. First, we consider a censored likelihood approach, implemented using Bayesian MCMC with a data augmentation step, but find that this approach is not robust to model misspecification. As an alternative, we develop a novel estimation method that minimizes the discrepancy between the TPDM for the fitted model and the estimated TPDM, and find that it is able to produce reasonable estimates of extremal dependence even in the case of model misspecification.Show more Item Open Access Analysis of structured data and big data with application to neuroscience(Colorado State University. Libraries, 2015) Sienkiewicz, Ela, author; Wang, Haonan, advisor; Meyer, Mary, committee member; Breidt, F. Jay, committee member; Hayne, Stephen, committee memberShow more Neuroscience research leads to a remarkable set of statistical challenges, many of them due to the complexity of the brain, its intricate structure and dynamical, non-linear, often non-stationary behavior. The challenge of modeling brain functions is magnified by the quantity and inhomogeneity of data produced by scientific studies. Here we show how to take advantage of advances in distributed and parallel computing to mitigate memory and processor constraints and develop models of neural components and neural dynamics. First we consider the problem of function estimation and selection in time-series functional dynamical models. Our motivating application is on the point-process spiking activities recorded from the brain, which poses major computational challenges for modeling even moderately complex brain functionality. We present a big data approach to the identification of sparse nonlinear dynamical systems using generalized Volterra kernels and their approximation using B-spline basis functions. The performance of the proposed method is demonstrated in experimental studies. We also consider a set of unlabeled tree objects with topological and geometric properties. For each data object, two curve representations are developed to characterize its topological and geometric aspects. We further define the notions of topological and geometric medians as well as quantiles based on both representations. In addition, we take a novel approach to define the Pareto medians and quantiles through a multi-objective optimization problem. In particular, we study two different objective functions which measure the topological variation and geometric variation respectively. Analytical solutions are provided for topological and geometric medians and quantiles, and in general, for Pareto medians and quantiles the genetic algorithm is implemented. The proposed methods are applied to analyze a data set of pyramidal neurons.Show more Item Open Access Application of statistical and deep learning methods to power grids(Colorado State University. Libraries, 2023) Rimkus, Mantautas, author; Kokoszka, Piotr, advisor; Wang, Haonan, advisor; Nielsen, Aaron, committee member; Cooley, Dan, committee member; Chen, Haonan, committee memberShow more The structure of power flows in transmission grids is evolving and is likely to change significantly in the coming years due to the rapid growth of renewable energy generation that introduces randomness and bidirectional power flows. Another transformative aspect is the increasing penetration of various smart-meter technologies. Inexpensive measurement devices can be placed at practically any component of the grid. As a result, traditional fault detection methods may no longer be sufficient. Consequently, there is a growing interest in developing new methods to detect power grid faults. Using model data, we first propose a two-stage procedure for detecting a fault in a regional power grid. In the first stage, a fault is detected in real time. In the second stage, the faulted line is identified with a negligible delay. The approach uses only the voltage modulus measured at buses (nodes of the grid) as the input. Our method does not require prior knowledge of the fault type. We further explore fault detection based on high-frequency data streams that are becoming available in modern power grids. Our approach can be treated as an online (sequential) change point monitoring methodology. However, due to the mostly unexplored and very nonstandard structure of high-frequency power grid streaming data, substantial new statistical development is required to make this methodology practically applicable. The work includes development of scalar detectors based on multichannel data streams, determination of data-driven alarm thresholds and investigation of the performance and robustness of the new tools. Due to a reasonably large database of faults, we can calculate frequencies of false and correct fault signals, and recommend implementations that optimize these empirical success rates. Next, we extend our proposed method for fault localization in a regional grid for scenarios where partial observability limits the available data. While classification methods have been proposed for fault localization, their effectiveness depends on the availability of labeled data, which is often impractical in real-life situations. Our approach bridges the gap between partial and full observability of the power grid. We develop efficient fault localization methods that can operate effectively even when only a subset of power grid bus data is available. This work contributes to the research area of fault diagnosis in scenarios where the number of available phasor measurement unit devices is smaller than the number of buses in the grid. We propose using Graph Neural Networks in combination with statistical fault localization methods to localize faults in a regional power grid with minimal available data. Our contribution to the field of fault localization aims to enable the adoption of effective fault localization methods for future power grids.Show more Item Open Access Bayesian methods for environmental exposures: mixtures and missing data(Colorado State University. Libraries, 2022) Hoskovec, Lauren, author; Wilson, Ander, advisor; Magzamen, Sheryl, committee member; Hoeting, Jennifer, committee member; Cooley, Dan, committee memberShow more Air pollution exposure has been linked to increased morbidity and mortality. Estimating the association between air pollution exposure and health outcomes is complicated by simultaneous exposure to multiple pollutants, referred to as a multipollutant mixture. In a multipollutant mixture, exposures may have both independent and interactive effects on health. In addition, observational studies of air pollution exposure often involve missing data. In this dissertation, we address challenges related to model choice and missing data when studying exposure to a mixture of environmental pollutants. First, we conduct a formal simulation study of recently developed methods for estimating the association between a health outcome and exposure to a multipollutant mixture. We evaluate methods on their performance in estimating the exposure-response function, identifying mixture components associated with the outcome, and identifying interaction effects. Other studies have reviewed the literature or compared performance on a single data set; however, none have formally compared such a broad range of new methods in a simulation study. Second, we propose a statistical method to analyze multiple asynchronous multivariate time series with missing data for use in personal exposure assessments. We develop an infinite hidden Markov model for multiple time series to impute missing data and identify shared time-activity patterns in exposures. We estimate hidden states that represent latent environments presenting a unique distribution of a mixture of environmental exposures. Through our multiple imputation algorithm, we impute missing exposure data conditional on the hidden states. Finally, we conduct an individual-level study of the association between long-term exposure to air pollution and COVID-19 severity in a Denver, Colorado, USA cohort. We develop a Bayesian multinomial logistic regression model for data with partially missing categorical outcomes. Our model uses Polya-gamma data augmentation, and we propose a visualization approach for inference on the odds ratio. We conduct one of the first individual-level studies of air pollution exposure and COVID-19 health outcomes using detailed clinical data and individual-level air pollution exposure data.Show more Item Open Access Bayesian methods for spatio-temporal ecological processes using imagery data(Colorado State University. Libraries, 2021) Lu, Xinyi, author; Hooten, Mevin, advisor; Kaplan, Andee, committee member; Fosdick, Bailey, committee member; Koons, David, committee memberShow more In this dissertation, I present novel Bayesian hierarchical models to statistically characterize spatio-temporal ecological processes. I am motivated by the volatility of Alaskan ecosystems in the face of global climate change and I demonstrate methods for emerging imagery data as survey technologies advance. For the nearshore marine ecosystem, I developed a model that combines ecological diffusion and logistic growth to quantify colonization dynamics of a population that establishes long-term equilibrium over a heterogeneous environment. I also unified modeling concepts from entity resolution and capture-recapture to identify unique individuals of the population from overlapping images and infer total abundance. For the terrestrial ecosystem, I developed a stochastic state-space model to quantify the impact of climate change on the structural transformation of land cover types. The methods presented in this dissertation provide interpretable inference and employ statistical computing strategies to achieve scalability.Show more Item Open Access Bayesian models and streaming samplers for complex data with application to network regression and record linkage(Colorado State University. Libraries, 2023) Taylor, Ian M., author; Kaplan, Andee, advisor; Fosdick, Bailey K., advisor; Keller, Kayleigh P., committee member; Koslovsky, Matthew D., committee member; van Leeuwen, Peter Jan, committee memberShow more Real-world statistical problems often feature complex data due to either the structure of the data itself or the methods used to collect the data. In this dissertation, we present three methods for the analysis of specific complex data: Restricted Network Regression, Streaming Record Linkage, and Generative Filtering. Network data contain observations about the relationships between entities. Applying mixed models to network data can be problematic when the primary interest is estimating unconditional regression coefficients and some covariates are exactly or nearly in the vector space of node-level effects. We introduce the Restricted Network Regression model that removes the collinearity between fixed and random effects in network regression by orthogonalizing the random effects against the covariates. We discuss the change in the interpretation of the regression coefficients in Restricted Network Regression and analytically characterize the effect of Restricted Network Regression on the regression coefficients for continuous response data. We show through simulation on continuous and binary data that Restricted Network Regression mitigates, but does not alleviate, network confounding. We apply the Restricted Network Regression model in an analysis of 2015 Eurovision Song Contest voting data and show how the choice of regression model affects inference. Data that are collected from multiple noisy sources pose challenges to analysis due to potential errors and duplicates. Record linkage is the task of combining records from multiple files which refer to overlapping sets of entities when there is no unique identifying field. In streaming record linkage, files arrive sequentially in time and estimates of links are updated after the arrival of each file. We approach streaming record linkage from a Bayesian perspective with estimates calculated from posterior samples of parameters, and present methods for updating link estimates after the arrival of a new file that are faster than fitting a joint model with each new data file. We generalize a two-file Bayesian Fellegi-Sunter model to the multi-file case and propose two methods to perform streaming updates. We examine the effect of prior distribution on the resulting linkage accuracy as well as the computational trade-offs between the methods when compared to a Gibbs sampler through simulated and real-world survey panel data. We achieve near-equivalent posterior inference at a small fraction of the compute time. Motivated by the streaming data setting and streaming record linkage, we propose a more general sampling method for Bayesian models for streaming data. In the streaming data setting, Bayesian models can employ recursive updates, incorporating each new batch of data into the model parameters' posterior distribution. Filtering methods are currently used to perform these updates efficiently, however, they suffer from eventual degradation as the number of unique values within the filtered samples decreases. We propose Generative Filtering, a method for efficiently performing recursive Bayesian updates in the streaming setting. Generative Filtering retains the speed of a filtering method while using parallel updates to avoid degenerate distributions after repeated applications. We derive rates of convergence for Generative Filtering and conditions for the use of sufficient statistics instead of storing all past data. We investigate properties of Generative Filtering through simulation and ecological species count data.Show more Item Open Access Bayesian shape-restricted regression splines(Colorado State University. Libraries, 2011) Hackstadt, Amber J., author; Hoeting, Jennifer, advisor; Meyer, Mary, advisor; Opsomer, Jean, committee member; Huyvaert, Kate, committee memberShow more Semi-parametric and non-parametric function estimation are useful tools to model the relationship between design variables and response variables as well as to make predictions without requiring the assumption of a parametric form for the regression function. Additionally, Bayesian methods have become increasingly popular in statistical analysis since they provide a flexible framework for the construction of complex models and produce a joint posterior distribution for the coefficients that allows for inference through various sampling methods. We use non-parametric function estimation and a Bayesian framework to estimate regression functions with shape restrictions. Shape-restricted functions include functions that are monotonically increasing, monotonically decreasing, convex, concave, and combinations of these restrictions such as increasing and convex. Shape restrictions allow researchers to incorporate knowledge about the relationship between variables into the estimation process. We propose Bayesian semi-parametric models for regression analysis under shape restrictions that use a linear combination of shape-restricted regression splines such as I-splines or C-splines. We find function estimates using Markov chain Monte Carlo (MCMC) algorithms. The Bayesian framework along with MCMC allows us to perform model selection and produce uncertainty estimates much more easily than in the frequentist paradigm. Indeed, some of the work proposed in this dissertation has not been developed in parallel in the frequentist paradigm. We begin by proposing a semi-parametric generalized linear model for regression analysis under shape-restrictions. We provide Bayesian shape-restricted regression spline (Bayes SRRS) models and MCMC estimation algorithms for the normal errors, Bernoulli, and Poisson models. We propose several types of inference that can be performed for the normal errors model as well as examine the asymptotic behavior of the estimates for the normal errors model under the monotone shape-restriction. We also examine the small sample behavior of the proposed Bayes SRRS model estimates via simulation studies. We then extend the semi-parametric Bayesian shape-restricted regression splines to generalized linear mixed models. We provide a MCMC algorithm to estimate functions for the random intercept model with normal errors under the monotone shape restriction. We then further extend the semi-parametric Bayesian shape-restricted regression splines to allow the number and location of the knot points for the regression splines to be random and propose a reversible jump Markov chain Monte Carlo (RJMCMC) algorithm for regression function estimation under the monotone shape restriction. Lastly, we propose a Bayesian shape-restricted regression spline change-point model where the regression function is shape-restricted except at the change-points. We provide RJMCMC algorithms to estimate functions with change-points where the number and location of interior knot points for the regression splines are random. We provide a RJMCMC algorithm to estimate the location of an unknown change-point as well as a RJMCMC algorithm to decide between a model with no change-points and model with a change-point.Show more Item Open Access Bayesian treed distributed lag models(Colorado State University. Libraries, 2021) Mork, Daniel S., author; Wilson, Ander, advisor; Sharp, Julia, committee member; Keller, Josh, committee member; Neophytou, Andreas, committee memberShow more In many applications there is interest in regressing an outcome on exposures observed over a previous time window. This frequently arises in environmental epidemiology where either a health outcome on one day is regressed on environmental exposures (e.g. temperature or air pollution) observed on that day and several proceeding days or when a birth or children's health outcome is regressed on exposures observed daily or weekly throughout pregnancy. The distributed lag model (DLM) is a statistical method commonly implemented to estimate an exposure-time-response function by regressing the outcome on repeated measures of a single exposure over a preceding time period, for example, mean exposure during each week of pregnancy. Inferential goals include estimating the exposure-time-response function and identifying critical windows during which exposures can alter a health endpoint. In this dissertation, we develop novel formulations of Bayesian additive regression trees that allow for estimating a DLM. First, we propose treed distributed lag nonlinear models to estimate the association between weekly maternal exposure to air pollution and a birth outcome when the exposure-response relation is nonlinear. We introduce a regression tree-based model that accommodates a multivariate predictor along with parametric control for fixed effects. Second, we propose a tree-based method for estimating the association between repeated measures of a mixture of multiple pollutants and a health outcome. The proposed approach introduces regression tree pairs, which allow for estimation of marginal effects of exposures along with structured interactions that account for the temporal ordering of the exposure data. Finally, we present a framework to estimate a heterogeneous DLM in the presence of a potentially high dimensional set of modifying variables. We present simulation studies to validate the models. We apply these methods to estimate the association between ambient pollution exposures and birth weight for a Colorado, USA birth cohort.Show more Item Open Access Causality and clustering in complex settings(Colorado State University. Libraries, 2023) Gibbs, Connor P., author; Keller, Kayleigh, advisor; Fosdick, Bailey, advisor; Koslovsky, Matthew, committee member; Kaplan, Andee, committee member; Anderson, Brooke, committee memberShow more Causality and clustering are at the forefront of many problems in statistics. In this dissertation, we present new methods and approaches for drawing causal inference with temporally dependent units and clustering nodes in heterogeneous networks. To begin, we investigate the causal effect of a timeout at stopping an opposing team's run in the National Basketball Association (NBA). After formalizing the notion of a run in the NBA and in light of the temporal dependence among runs, we define the units under study with careful consideration of the stable unit-treatment-value assumption pertinent to the Rubin causal model. After introducing a novel, interpretable outcome based on the score difference, we conclude that while comebacks frequently occur after a run, it is slightly disadvantageous to call a timeout during a run by the opposing team. Further, we demonstrate that the magnitude of this effect varies by franchise, lending clarity to an oft-debated topic among sports' fans. Following, we represent the known relationships among and between genetic variants and phenotypic abnormalities as a heterogeneous network and introduce a novel analytic pipeline to identify clusters containing undiscovered gene to phenotype relations (ICCUR) from the network. ICCUR identifies, scores, and ranks small heterogeneous clusters according to their potential for future discovery in a large temporal biological network. We train an ensemble model of boosted regression trees to predict clusters' potential for future discovery using observable cluster features, and show the resulting clusters contain significantly more undiscovered gene to phenotype relations than expected by chance. To demonstrate its use as a diagnostic aid, we apply the results of the ICCUR pipeline to real, undiagnosed patients with rare diseases, identifying clusters containing patients' co-occurring yet otherwise unconnected genotypic and phenotypic information, some connections which have since been validated by human curation. Motivated by ICCUR and its application, we introduce a novel method called ECoHeN (pronounced "eco-hen") to extract communities from heterogeneous networks in a statistically meaningful way. Using a heterogeneous configuration model as a reference distribution, ECoHeN identifies communities that are significantly more densely connected than expected given the node types and connectivity of its membership without imposing constraints on the type composition of the extracted communities. The ECoHeN algorithm identifies communities one at a time through a dynamic set of iterative updating rules and is guaranteed to converge. To our knowledge this is the first discovery method that distinguishes and identifies both homogeneous and heterogeneous, possibly overlapping, community structure in a network. We demonstrate the performance of ECoHeN through simulation and in application to a political blogs network to identify collections of blogs which reference one another more than expected considering the ideology of its' members. Along with small partisan communities, we demonstrate ECoHeN's ability to identify a large, bipartisan community undetectable by canonical community detection methods and denser than modern, competing methods.Show more Item Open Access Change-Point estimation using shape-restricted regression splines(Colorado State University. Libraries, 2016) Liao, Xiyue, author; Meyer, Mary C., advisor; Breidt, F. Jay, committee member; Homrighausen, Darren, committee member; Belfiori, Elisa, committee memberShow more Change-Point estimation is in need in fields like climate change, signal processing, economics, dose-response analysis etc, but it has not yet been fully discussed. We consider estimating a regression function ƒm and a change-point m, where m is a mode, an inflection point, or a jump point. Linear inequality constraints are used with spline regression functions to estimate m and ƒm simultaneously using profile methods. For a given m, the maximum-likelihood estimate of ƒm is found using constrained regression methods, then the set of possible change-points is searched to find the ˆm that maximizes the likelihood. Convergence rates are obtained for each type of change-point estimator, and we show an oracle property, that the convergence rate of the regression function estimator is as if m were known. Parametrically modeled covariates are easily incorporated in the model. Simulations show that for small and moderate sample sizes, these methods compare well to existing methods. The scenario when the random error is from a stationary autoregressive process is also presented. Under such a scenario, the change-point and parameters of the stationary autoregressive process, such as autoregressive coefficients and the model variance, are estimated together via Cochran-Orcutt-type iterations. Simulations are conducted and it is shown that the change-point estimator performs well in terms of choosing the right order of the autoregressive process. Penalized spline-based regression is also discussed as an extension. Given a large number of knots and a penalty parameter which controls the effective degrees of freedom of a shape-restricted model, penalized methods give smoother fits while balance between under- and over-fitting. A bootstrap confidence interval for a change-point is established. By generating random change-points from a curve on the unit interval, we compute the coverage rate of the bootstrap confidence interval using penalized estimators, which shows advantages such as robustness over competitors. The methods are available in the R package ShapeChange on the Comprehensive R Archival Network (CRAN). Moreover, we discuss the shape selection problem when there are more than one possible shapes for a given data set. A project with the Forest Inventory & Analysis (FIA) scientists is included as an example. In this project, we apply shape-restricted spline-based estimators, among which the one-jump and double-jump estimators are emphasized, to time-series Landsat imagery for the purpose of modeling, mapping, and monitoring annual forest disturbance dynamics. For each pixel and spectral band or index of choice in temporal Landsat data, our method delivers a smoothed rendition of the trajectory constrained to behave in an ecologically sensible manner, reflecting one of seven possible “shapes”. Routines to realize the methodology are built in the R package ShapeSelectForest on CRAN, and techniques in this package are being applied for forest disturbance and attribute mapping across the conterminous U.S.. The Landsat community will implement techniques in this package on the Google Earth Engine in 2016. Finally, we consider the change-point estimation with generalized linear models. Such work can be applied to dose-response analysis, when the effect of a drug increases as the dose increases to a saturation point, after which the effect starts decreasing.Show more Item Open Access Constrained spline regression and hypothesis tests in the presence of correlation(Colorado State University. Libraries, 2013) Wang, Huan, author; Meyer, Mary C., advisor; Opsomer, Jean D., advisor; Breidt, F. Jay, committee member; Reich, Robin M., committee memberShow more Extracting the trend from the pattern of observations is always difficult, especially when the trend is obscured by correlated errors. Often, prior knowledge of the trend does not include a parametric family, and instead the valid assumption are vague, such as "smooth" or "monotone increasing," Incorrectly specifying the trend as some simple parametric form can lead to overestimation of the correlation, and conversely, misspecifying or ignoring the correlation leads to erroneous inference for the trend. In this dissertation, we explore spline regression with shape constraints, such as monotonicity or convexity, for estimation and inference in the presence of stationary AR(p) errors. Standard criteria for selection of penalty parameter, such as Akaike information criterion (AIC), cross-validation and generalized cross-validation, have been shown to behave badly when the errors are correlated and in the absence of shape constraints. In this dissertation, correlation structure and penalty parameter are selected simultaneously using a correlation-adjusted AIC. The asymptotic properties of unpenalized spline regression in the presence of correlation are investigated. It is proved that even if the estimation of the correlation is inconsistent, the corresponding projection estimation of the regression function can still be consistent and have the optimal asymptotic rate, under appropriate conditions. The constrained spline fit attains the convergence rate of unconstrained spline fit in the presence of AR(p) errors. Simulation results show that the constrained estimator typically behaves better than the unconstrained version if the true trend satisfies the constraints. Traditional statistical tests for the significance of a trend rely on restrictive assumptions on the functional form of the relationship, e.g. linearity. In this dissertation, we develop testing procedures that incorporate shape restrictions on the trend and can account for correlated errors. These tests can be used in checking whether the trend is constant versus monotone, linear versus convex/concave and any combinations such as, constant versus increase and convex. The proposed likelihood ratio test statistics have an exact null distribution if the covariance matrix of errors is known. Theorems are developed for the asymptotic distributions of test statistics if the covariance matrix is unknown but the test statistics use a consistent estimator of correlation into their estimation. The comparisons of the proposed test with the F-test with the unconstrained alternative fit and the one-sided t-test with simple regression alternative fit are conducted through intensive simulations. Both test size and power of the proposed test are favorable, smaller test size and greater power in general, comparing to the F-test and t-test.Show more Item Open Access Estimation and linear prediction for regression, autoregression and ARMA with infinite variance data(Colorado State University. Libraries, 1983) Cline, Daren B. H., author; Resnick, Sidney I., advisor; Brockwell, Peter J., advisor; Locker, John, committee member; Davis, Richard A., committee member; Boes, Duane C., committee memberShow more This dissertation is divided into four parts, each of which considers random variables from distributions with regularly varying tails and/or in a stable domain of attraction. Part I considers the existence of infinite series of an independent sequence of such random variables and the relationship of the probability of large values of the series to the probability of large values of the first component. Part II applies Part I in order to provide a linear predictor for ARMA time series (again with regularly varying tails). This predictor is designed to minimize the probability of large prediction errors relative to the tails of the noise distribution. Part III investigates the products of independent random variables where one has distribution in a stable domain of attraction and gives conditions for which the product distribution is in a stable domain of attraction. Part IV considers estimation of the regression parameter in a model where the independent variables are in a stable domain of attraction. Consistency for certain M-estimators is proved. Utilizing portions of Part III this final part gives necessary and sufficient conditions for consistency of least squares estimators and provides the asymptotic distribution of least squares estimators.Show more Item Open Access Habitat estimation through synthesis of species presence/absence information and environmental covariate data(Colorado State University. Libraries, 2011) Dornan, Grant J., author; Givens, Geof H., advisor; Hoeting, Jennifer A., committee member; Chapman, Phillip L., committee member; Myrick, Christopher A., committee memberShow more This paper investigates the statistical model developed by Foster, et al. (2011) to estimate marine habitat maps based on environmental covariate data and species presence/absence information while treating habitat definition probabilistically. The model assumes that two sites belonging to the same habitat have approximately the same species presence probabilities, and thus both environmental data and species presence observations can help to distinguish habitats at locations across a study region. I develop a computational method to estimate the model parameters by maximum likelihood using a blocked non-linear Gauss-Seidel algorithm. The main part of my work is developing and conducting simulation studies to evaluate estimation performance and to study related questions including the impacts of sample size, model bias and model misspecification. Seven testing scenarios are developed including between 3 and 9 habitats, 15 and 40 species, and 150 and 400 sampling sites. Estimation performance is primarily evaluated through fitted habitat maps and is shown to be excellent for the seven example scenarios examined. Rates of successful habitat classification ranged from 0.92 to 0.98. I show that there is a roughly balanced tradeoff between increasing the number of sites and increasing the number of species for improving estimation performance. Standard model selection techniques are shown to work for selection of covariates, but selection of the number of habitats benefits from supplementing quantitative techniques with qualitative expert judgement. Although estimation of habitat boundaries is extremely good, the rate of probabilistic transition between habitats is shown to be difficult to estimate accurately. Future research should address this issue. An appendix to this thesis includes a comprehensive and annotated collection of R code developed during this project.Show more Item Open Access Heavy tail analysis for functional and internet anomaly data(Colorado State University. Libraries, 2021) Kim, Mihyun, author; Kokoszka, Piotr, advisor; Cooley, Daniel, committee member; Meyer, Mary, committee member; Pinaud, Olivier, committee memberShow more This dissertation is concerned with the asymptotic theory of statistical tools used in extreme value analysis of functional data and internet anomaly data. More specifically, we study four problems associated with analyzing the tail behavior of functional principal component scores in functional data and interarrival times of internet traffic anomalies, which are available only with a round-off error. The first problem we consider is the estimation of the tail index of scores in functional data. We employ the Hill estimator for the tail index estimation and derive conditions under which the Hill estimator computed from the sample scores is consistent for the tail index of the unobservable population scores. The second problem studies the dependence between extremal values of functional scores using the extremal dependence measure (EDM). After extending the EDM defined for positive bivariate observations to multivariate observations, we study conditions guaranteeing that a suitable estimator of the EDM based on these scores converges to the population EDM and is asymptotically normal. The third and last problems investigate the asymptotic and finite sample behavior of the Hill estimator applied to heavy-tailed data contaminated by errors. For the third one, we show that for time series models often used in practice, whose non–contaminated marginal distributions are regularly varying, the Hill estimator is consistent. For the last one, we formulate conditions on the errors under which the Hill and Harmonic Moment estimators applied to i.i.d. data continue to be asymptotically normal. The results of large and finite sample investigations are applied to internet anomaly data.Show more Item Open Access Impact of actual and self-perceived body type on visual perception of distances(Colorado State University. Libraries, 2015) Branan, Matthew, author; Turk, Phil, advisor; Witt, Jessica, committee member; Hess, Ann, committee memberShow more We investigate several questions regarding the proposition that physical body size and one's image of their own body type affect the ability to make accurate judgements of distances. Data collected include subjects' guesses of distances of four cones set 10, 15, 20, and 25 meters away and the weight, BMI, and self-perception of body image for each of 67 subjects. Interest lies in determining the covariates that are most important in explaining one's ability to accurately judge distances and whether weight or BMI is the better explainer among the physical body size predictors. We utilize linear mixed models to account for correlation among each subjects' own distance guesses and to allow for flexible modeling of subject-specific effects. Flexibility is further promoted through use of model averaging techniques to account for model selection uncertainty inherent in typical approaches in which an analyst selects only one model from which inferences are made. A generalization of the coefficient of determination from ordinary linear models is made to the linear mixed model setting (R²LMM) in order to provide an additional goodness measure for fixed effects and for individual fixed effects themselves. Baseline differences among subjects' ability to accurately judge distances are so vast that extracting the importance of the fixed effects becomes difficult. It is found that body size is a significant predictor of subjects' ability to accurately judge distances but body image is not at the 0.05 significance level. We recommend choosing weight over BMI as a predictor of guessing behavior based on information criteria, model averaging, and the generalized R²LMM. Specifically, heavier individuals tend to guess more accurately.Show more Item Open Access Improved estimation and prediction for computationally expensive ecological and paleoclimate models(Colorado State University. Libraries, 2016) Tipton, John, author; Hooten, Mevin, advisor; Opsomer, Jean, advisor; Hoeting, Jennifer, committee member; Aldridge, Cameron, committee memberShow more In this dissertation, we present statistical methods to evaluate estimation and prediction performance for applied ecological problems. We explore a variety of applied problems and, within this context, we investigate how each method performs. We evaluate empirical performance of a model-based estimator of mean percent canopy cover using a representative United States Forest Service Forest Inventory and Analysis dataset. For two paleoclimate reconstructions, we develop novel modeling methodologies and evaluate model performance using both resampling and simulation methods. In each application, we use proper scoring rules while leveraging parallel computing and computational techniques, that allow fitting of complex models in a finite amount of time.Show more Item Open Access Improved estimation for complex surveys using modern regression techniques(Colorado State University. Libraries, 2011) McConville, Kelly, author; Breidt, F. Jay, advisor; Lee, Thomas, C. M., advisor; Opsomer, Jean, committee member; Lee, Myung-Hee, committee member; Doherty, Paul F., committee memberShow more In the field of survey statistics, finite population quantities are often estimated based on complex survey data. In this thesis, estimation of the finite population total of a study variable is considered. The study variable is available for the sample and is supplemented by auxiliary information, which is available for every element in the finite population. Following a model-assisted framework, estimators are constructed that exploit the relationship which may exist between the study variable and ancillary data. These estimators have good design properties regardless of model accuracy. Nonparametric survey regression estimation is applicable in natural resource surveys where the relationship between the auxiliary information and study variable is complex and of an unknown form. Breidt, Claeskens, and Opsomer (2005) proposed a penalized spline survey regression estimator and studied its properties when the number of knots is fixed. To build on their work, the asymptotic properties of the penalized spline regression estimator are considered when the number of knots goes to infinity and the locations of the knots are allowed to change. The estimator is shown to be design consistent and asymptotically design unbiased. In the course of the proof, a result is established on the uniform convergence in probability of the survey-weighted quantile estimators. This result is obtained by deriving a survey-weighted Hoeffding inequality for bounded random variables. A variance estimator is proposed and shown to be design consistent for the asymptotic mean squared error. Simulation results demonstrate the usefulness of the asymptotic approximations. Also in natural resource surveys, a substantial amount of auxiliary information, typically derived from remotely-sensed imagery and organized in the form of spatial layers in a geographic information system (GIS), is available. Some of this ancillary data may be extraneous and a sparse model would be appropriate. Model selection methods are therefore warranted. The 'least absolute shrinkage and selection operator' (lasso), presented by Tibshirani (1996), conducts model selection and parameter estimation simultaneously by penalizing the sum of the absolute values of the model coefficients. A survey-weighted lasso criterion, which accounts for the sampling design, is derived and a survey-weighted lasso estimator is presented. The root-n design consistency of the estimator and a central limit theorem result are proved. Several variants of the survey-weighted lasso estimator are constructed. In particular, a calibration estimator and a ridge regression approximation estimator are constructed to produce lasso weights that can be applied to several study variables. Simulation studies show the lasso estimators are more efficient than the regression estimator when the true model is sparse. The lasso estimators are used to estimate the proportion of tree canopy cover for a region of Utah. Under a joint design-model framework, the survey-weighted lasso coefficients are shown to be root-N consistent for the parameters of the superpopulation model and a central limit theorem result is found. The methodology is applied to estimate the risk factors for the Zika virus from an epidemiological survey on the island of Yap. A logistic survey-weighted lasso regression model is fit to the data and important covariates are identified.Show more