Theses and Dissertations
Permanent URI for this collection
Browse
Browsing Theses and Dissertations by Title
Now showing 1 - 20 of 89
Results Per Page
Sort Options
Item Open Access A fiducial approach to extremes and multiple comparisons(Colorado State University. Libraries, 2010) Wandler, Damian V., author; Hannig, Jan, advisor; Iyer, Hariharan K., advisor; Chong, Edwin Kah Pin, committee member; Wang, Haonan, committee memberGeneralized fiducial inference is a powerful tool for many difficult problems. Based on an extension of R. A. Fisher's work, we used generalized fiducial inference for two extreme value problems and a multiple comparison procedure. The first extreme value problem is dealing with the generalized Pareto distribution. The generalized Pareto distribution is relevant to many situations when modeling extremes of random variables. We use a fiducial framework to perform inference on the parameters and the extreme quantiles of the generalized Pareto. This inference technique is demonstrated in both cases when the threshold is a known and unknown parameter. Simulation results suggest good empirical properties and compared favorably to similar Bayesian and frequentist methods. The second extreme value problem pertains to the largest mean of a multivariate normal distribution. Difficulties arise when two or more of the means are simultaneously the largest mean. Our solution uses a generalized fiducial distribution and allows for equal largest means to alleviate the overestimation that commonly occurs. Theoretical calculations, simulation results, and application suggest our solution possesses promising asymptotic and empirical properties. Our solution to the largest mean problem arose from our ability to identify the correct largest mean(s). This essentially became a model selection problem. As a result, we applied a similar model selection approach to the multiple comparison problem. We allowed for all possible groupings (of equality) of the means of k independent normal distributions. Our resulting fiducial probability for the groupings of the means demonstrates the effectiveness of our method by selecting the correct grouping at a high rate.Item Open Access A novel approach to statistical problems without identifiability(Colorado State University. Libraries, 2024) Adams, Addison D., author; Wang, Haonan, advisor; Zhou, Tianjian, advisor; Kokoszka, Piotr, committee member; Shaby, Ben, committee member; Ray, Indrakshi, committee memberIn this dissertation, we propose novel approaches to random coefficient regression (RCR) and the recovery of mixing distributions under nonidentifiable scenarios. The RCR model is an extension of the classical linear regression model that accounts for individual variation by treating the regression coefficients as random variables. A major interest lies in the estimation of the joint probability distribution of these random coefficients based on the observable samples of the outcome variable evaluated for different values of the explanatory variables. In Chapter 2, we consider fixed-design RCR models, under which the coefficient distribution is not identifiable. To tackle the challenges of nonidentifiability, we consider an equivalence class, in which each element is a plausible coefficient distribution that, for each value of the explanatory variables, yields the same distribution for the outcome variable. In particular, we formulate the approximations of the coefficient distributions as a collection of stochastic inverse problems, allowing for a more flexible nonparametric approach with minimal assumptions. An iterative approach is proposed to approximate the elements by incorporating an initial guess of a solution called the global ansatz. We further study its convergence and demonstrate its performance through simulation studies. The proposed approach is applied to a real data set from an acupuncture clinical trial. In Chapter 3, we consider the problem of recovering a mixing distribution, given a component distribution family and observations from a compound distribution. Most existing methods are restricted in scope in that they are developed for certain component distribution families or continuity structures of mixing distributions. We propose a new, flexible nonparametric approach with minimal assumptions. Our proposed method iteratively steps closer to the desired mixing distribution, starting from a user-specified distribution, and we further establish its convergence properties. Simulation studies are conducted to examine the performance of our proposed method. In addition, we demonstrate the utility of our proposed method through its application to two sets of real-world data, including prostate cancer data and Shakespeare's canon word count.Item Open Access A penalized estimation procedure for varying coefficient models(Colorado State University. Libraries, 2015) Tu, Yan, author; Wang, Haonan, advisor; Breidt, F. Jay, committee member; Chapman, Phillip, committee member; Luo, J. Rockey, committee memberVarying coefficient models are widely used for analyzing longitudinal data. Various methods for estimating coefficient functions have been developed over the years. We revisit the problem under the theme of functional sparsity. The problem of sparsity, including global sparsity and local sparsity, is a recurrent topic in nonparametric function estimation. A function has global sparsity if it is zero over the entire domain, and it indicates that the corresponding covariate is irrelevant to the response variable. A function has local sparsity if it is nonzero but remains zero for a set of intervals, and it identifies an inactive period of the corresponding covariate. Each type of sparsity has been addressed in the literature using the idea of regularization to improve estimation as well as interpretability. In this dissertation, a penalized estimation procedure has been developed to achieve functional sparsity, that is, simultaneously addressing both types of sparsity in a unified framework. We exploit the property of B-spline approximation and group bridge penalization. Our method is illustrated in simulation study and real data analysis, and outperforms the existing methods in identifying both local sparsity and global sparsity. Asymptotic properties of estimation consistency and sparsistency of the proposed method are established. The term of sparsistency refers to the property that the functional sparsity can be consistently detected.Item Open Access Adjusting for capture, recapture, and identity uncertainty when estimating detection probability from capture-recapture surveys(Colorado State University. Libraries, 2015) Edmondson, Stacy L., author; Givens, Geof, advisor; Opsomer, Jean, committee member; Kokoszka, Piotr, committee member; Noon, Barry, committee memberWhen applying capture-recapture analysis methods, estimates of detection probability, and hence abundance estimates, can be biased if individuals of a population are not correctly identified (Creel et. al., 2003). My research, motivated by the 2010 and 2011 surveys of Western Arctic bowhead whales conducted off the shores of Barrow, Alaska, offers two methods for addressing the complex scenario where an individual may be mistaken as another individual from that population, thus creating erroneous recaptures. The first method uses a likelihood weighted capture recapture method to account for three sources of uncertainty in the matching process. I illustrate this approach with a detailed application to the whale data. The second method develops an explicit model for match errors and uses MCMC methods to estimate model parameters. Implementation of this approach must overcome significant hurdles dealing with the enormous number and complexity of potential catch history configurations when matches are uncertain. The performance of this approach is evaluated using a large set of Monte Carlo simulation tests. Results of these test vary from good performance to weak performance, depending on factors including detection probability, number of sightings, and error rates. Finally, this model is applied to a portion of the bowhead survey data and found to produce plausible and scientifically informative results as long as the MCMC algorithm is started at a reasonable point in the space of possible catch history configurations.Item Open Access Advances in Bayesian spatial statistics for ecology and environmental science(Colorado State University. Libraries, 2024) Wright, Wilson J., author; Hooten, Mevin B., advisor; Cooley, Daniel S., advisor; Keller, Kayleigh P., committee member; Kaplan, Andee, committee member; Ross, Matthew R. V., committee memberIn this dissertation, I develop new Bayesian methods for analyzing spatial data from applications in ecology and environmental science. In particular, I focus on methods for mechanistic spatial models and binary spatial processes. I first consider the distribution of heavy metal pollution from a mining road in Cape Krusenstern, Alaska, USA. I develop a mechanistic spatial model that uses the physical process of atmospheric dispersion to characterize the spatial structure in these data. This approach directly incorporates scientific knowledge about how pollutants spread and provides inferences about this process. To assess how the heavy metal pollution impacts the vegetation community in Cape Krusenstern, I also develop a new model that represents plant cover for multiple species using clipped Gaussian processes. This approach is applicable to multiscale and multivariate binary processes that are observed at point locations — including multispecies plant cover data collected using the point intercept method. By directly analyzing the point-level data, instead of aggregating observations to the plot-level, this model allows for inferences about both large-scale and small-scale spatial dependence in plant cover. Additionally, it also incorporates dependence among different species at the small spatial scale. The third model I develop is motivated by ecological studies of wildlife occupancy. Similar to plant cover, species occurrence can be modeled as a binary spatial process. However, occupancy data are inherently measured at areal survey units. I develop a continuous-space occupancy model that accounts for the change of spatial support between the occurrence process and the observed data. All of these models are implemented using Bayesian methods and I present computationally efficient methods for fitting them. This includes a new surrogate data slice sampler for implementing models with latent nearest neighbor Gaussian processes.Item Open Access Advances in statistical analysis and modeling of extreme values motivated by atmospheric models and data products(Colorado State University. Libraries, 2018) Fix, Miranda J., author; Cooley, Daniel, advisor; Hoeting, Jennifer, committee member; Wilson, Ander, committee member; Barnes, Elizabeth, committee memberThis dissertation presents applied and methodological advances in the statistical analysis and modeling of extreme values. We detail three studies motivated by the types of data found in the atmospheric sciences, such as deterministic model output and observational products. The first two investigations represent novel applications and extensions of extremes methodology to climate and atmospheric studies. The third investigation proposes a new model for areal extremes and develops methods for estimation and inference from the proposed model. We first detail a study which leverages two initial condition ensembles of a global climate model to compare future precipitation extremes under two climate change scenarios. We fit non-stationary generalized extreme value (GEV) models to annual maximum daily precipitation output and compare impacts under the RCP8.5 and RCP4.5 scenarios. A methodological contribution of this work is to demonstrate the potential of a "pattern scaling" approach for extremes, in which we produce predictive GEV distributions of annual precipitation maxima under RCP4.5 given only global mean temperatures for this scenario. We compare results from this less computationally intensive method to those obtained from our GEV model fitted directly to the RCP4.5 output and find that pattern scaling produces reasonable projections. The second study examines, for the first time, the capability of an atmospheric chemistry model to reproduce observed meteorological sensitivities of high and extreme surface ozone (O3). This work develops a novel framework in which we make three types of comparisons between simulated and observational data, comparing (1) tails of the O3 response variable, (2) distributions of meteorological predictor variables, and (3) sensitivities of high and extreme O3 to meteorological predictors. This last comparison is made using quantile regression and a recent tail dependence optimization approach. Across all three study locations, we find substantial differences between simulations and observational data in both meteorology and meteorological sensitivities of high and extreme O3. The final study is motivated by the prevalence of large gridded data products in the atmospheric sciences, and presents methodological advances in the (finite-dimensional) spatial setting. Existing models for spatial extremes, such as max-stable process models, tend to be geostatistical in nature as well as very computationally intensive. Instead, we propose a new model for extremes of areal data, with a common-scale extension, that is inspired by the simultaneous autoregressive (SAR) model in classical spatial statistics. The proposed model extends recent work on transformed-linear operations applied to regularly varying random vectors, and is unique among extremes models in being directly analogous to a classical linear model. We specify a sufficient condition on the spatial dependence parameter such that our extreme SAR model has desirable properties. We also describe the limiting angular measure, which is discrete, and corresponding tail pairwise dependence matrix (TPDM) for the model. After examining model properties, we then investigate two approaches to estimation and inference for the common-scale extreme SAR model. First, we consider a censored likelihood approach, implemented using Bayesian MCMC with a data augmentation step, but find that this approach is not robust to model misspecification. As an alternative, we develop a novel estimation method that minimizes the discrepancy between the TPDM for the fitted model and the estimated TPDM, and find that it is able to produce reasonable estimates of extremal dependence even in the case of model misspecification.Item Open Access Analysis of structured data and big data with application to neuroscience(Colorado State University. Libraries, 2015) Sienkiewicz, Ela, author; Wang, Haonan, advisor; Meyer, Mary, committee member; Breidt, F. Jay, committee member; Hayne, Stephen, committee memberNeuroscience research leads to a remarkable set of statistical challenges, many of them due to the complexity of the brain, its intricate structure and dynamical, non-linear, often non-stationary behavior. The challenge of modeling brain functions is magnified by the quantity and inhomogeneity of data produced by scientific studies. Here we show how to take advantage of advances in distributed and parallel computing to mitigate memory and processor constraints and develop models of neural components and neural dynamics. First we consider the problem of function estimation and selection in time-series functional dynamical models. Our motivating application is on the point-process spiking activities recorded from the brain, which poses major computational challenges for modeling even moderately complex brain functionality. We present a big data approach to the identification of sparse nonlinear dynamical systems using generalized Volterra kernels and their approximation using B-spline basis functions. The performance of the proposed method is demonstrated in experimental studies. We also consider a set of unlabeled tree objects with topological and geometric properties. For each data object, two curve representations are developed to characterize its topological and geometric aspects. We further define the notions of topological and geometric medians as well as quantiles based on both representations. In addition, we take a novel approach to define the Pareto medians and quantiles through a multi-objective optimization problem. In particular, we study two different objective functions which measure the topological variation and geometric variation respectively. Analytical solutions are provided for topological and geometric medians and quantiles, and in general, for Pareto medians and quantiles the genetic algorithm is implemented. The proposed methods are applied to analyze a data set of pyramidal neurons.Item Open Access Application of statistical and deep learning methods to power grids(Colorado State University. Libraries, 2023) Rimkus, Mantautas, author; Kokoszka, Piotr, advisor; Wang, Haonan, advisor; Nielsen, Aaron, committee member; Cooley, Dan, committee member; Chen, Haonan, committee memberThe structure of power flows in transmission grids is evolving and is likely to change significantly in the coming years due to the rapid growth of renewable energy generation that introduces randomness and bidirectional power flows. Another transformative aspect is the increasing penetration of various smart-meter technologies. Inexpensive measurement devices can be placed at practically any component of the grid. As a result, traditional fault detection methods may no longer be sufficient. Consequently, there is a growing interest in developing new methods to detect power grid faults. Using model data, we first propose a two-stage procedure for detecting a fault in a regional power grid. In the first stage, a fault is detected in real time. In the second stage, the faulted line is identified with a negligible delay. The approach uses only the voltage modulus measured at buses (nodes of the grid) as the input. Our method does not require prior knowledge of the fault type. We further explore fault detection based on high-frequency data streams that are becoming available in modern power grids. Our approach can be treated as an online (sequential) change point monitoring methodology. However, due to the mostly unexplored and very nonstandard structure of high-frequency power grid streaming data, substantial new statistical development is required to make this methodology practically applicable. The work includes development of scalar detectors based on multichannel data streams, determination of data-driven alarm thresholds and investigation of the performance and robustness of the new tools. Due to a reasonably large database of faults, we can calculate frequencies of false and correct fault signals, and recommend implementations that optimize these empirical success rates. Next, we extend our proposed method for fault localization in a regional grid for scenarios where partial observability limits the available data. While classification methods have been proposed for fault localization, their effectiveness depends on the availability of labeled data, which is often impractical in real-life situations. Our approach bridges the gap between partial and full observability of the power grid. We develop efficient fault localization methods that can operate effectively even when only a subset of power grid bus data is available. This work contributes to the research area of fault diagnosis in scenarios where the number of available phasor measurement unit devices is smaller than the number of buses in the grid. We propose using Graph Neural Networks in combination with statistical fault localization methods to localize faults in a regional power grid with minimal available data. Our contribution to the field of fault localization aims to enable the adoption of effective fault localization methods for future power grids.Item Open Access Applications of generalized fiducial inference(Colorado State University. Libraries, 2009) E, Lidong, author; Iyer, Hariharan K., advisorHannig (2008) generalized Fisher's fiducial argument and obtained a fiducial recipe for interval estimation that is applicable in virtually any situation. In this dissertation research, we apply this fiducial recipe and fiducial generalized pivotal quantity to make inference in four practical problems. The list of problems we consider is (a) confidence intervals for variance components in an unbalanced two-component normal mixed linear model; (b) confidence intervals for median lethal dose (LD50) in bioassay experiments; (c) confidence intervals for the concordance correlation coefficient (CCC) in method comparison; (d) simultaneous confidence intervals for ratios of means of Lognormal distributions. For all the fiducial generalized confidence intervals (a)-(d), we conducted a simulation study to evaluate their performance and compare them with other competing confidence interval procedures from the literature. We also proved that the intervals (a) and (d) have asymptotically exact frequentist coverage.Item Open Access Applications of least squares penalized spline density estimator(Colorado State University. Libraries, 2024) Jing, Hanxiao, author; Meyer, Mary, advisor; Cooley, Daniel, committee member; Kokoszka, Piotr, committee member; Berger, Joshua, committee memberThe spline-based method stands as one of the most common nonparametric approaches. The work in this dissertation explores three applications of the least squares penalized spline density estimator. Firstly, we present a novel hypothesis test against the unimodality of density functions, based on unimodal and bimodal estimates of the density function, using penalized splines. The test statistic is the difference in the least-squares criterion, between these fits. The distribution of the test statistics under the null hypothesis is estimated via simulated data sets from the unimodal fit. Large sample theory is derived and simulation studies are conducted to compare its performance with other common methods across various scenarios, alongside a real-world application involving neuro-transmission data from guinea pig brains. Secondly, we tackle the deconvolution density estimation problem, introducing the penalized splines deconvolution estimator. Building upon the results gained from piecewise constant splines, we achieve a cube-root convergence rate for piecewise quadratic splines and uniform errors. Moreover, we derive large sample theories for the penalized spline estimator and the constrained spline estimator. Simulation studies illustrate the competitive performance of our estimators compared to the kernel estimators across diverse scenarios. Lastly, drawing inspiration from the preceding applications, we develop a hypothesis test to discern whether the underlying density is unimodal or multimodal, given data with measurement error. Under the assumption of uniform errors, we introduce the test and derive the test statistic. Simulations are conducted to show the performance of the proposed test under different conditions.Item Open Access Bayesian methods for environmental exposures: mixtures and missing data(Colorado State University. Libraries, 2022) Hoskovec, Lauren, author; Wilson, Ander, advisor; Magzamen, Sheryl, committee member; Hoeting, Jennifer, committee member; Cooley, Dan, committee memberAir pollution exposure has been linked to increased morbidity and mortality. Estimating the association between air pollution exposure and health outcomes is complicated by simultaneous exposure to multiple pollutants, referred to as a multipollutant mixture. In a multipollutant mixture, exposures may have both independent and interactive effects on health. In addition, observational studies of air pollution exposure often involve missing data. In this dissertation, we address challenges related to model choice and missing data when studying exposure to a mixture of environmental pollutants. First, we conduct a formal simulation study of recently developed methods for estimating the association between a health outcome and exposure to a multipollutant mixture. We evaluate methods on their performance in estimating the exposure-response function, identifying mixture components associated with the outcome, and identifying interaction effects. Other studies have reviewed the literature or compared performance on a single data set; however, none have formally compared such a broad range of new methods in a simulation study. Second, we propose a statistical method to analyze multiple asynchronous multivariate time series with missing data for use in personal exposure assessments. We develop an infinite hidden Markov model for multiple time series to impute missing data and identify shared time-activity patterns in exposures. We estimate hidden states that represent latent environments presenting a unique distribution of a mixture of environmental exposures. Through our multiple imputation algorithm, we impute missing exposure data conditional on the hidden states. Finally, we conduct an individual-level study of the association between long-term exposure to air pollution and COVID-19 severity in a Denver, Colorado, USA cohort. We develop a Bayesian multinomial logistic regression model for data with partially missing categorical outcomes. Our model uses Polya-gamma data augmentation, and we propose a visualization approach for inference on the odds ratio. We conduct one of the first individual-level studies of air pollution exposure and COVID-19 health outcomes using detailed clinical data and individual-level air pollution exposure data.Item Open Access Bayesian methods for spatio-temporal ecological processes using imagery data(Colorado State University. Libraries, 2021) Lu, Xinyi, author; Hooten, Mevin, advisor; Kaplan, Andee, committee member; Fosdick, Bailey, committee member; Koons, David, committee memberIn this dissertation, I present novel Bayesian hierarchical models to statistically characterize spatio-temporal ecological processes. I am motivated by the volatility of Alaskan ecosystems in the face of global climate change and I demonstrate methods for emerging imagery data as survey technologies advance. For the nearshore marine ecosystem, I developed a model that combines ecological diffusion and logistic growth to quantify colonization dynamics of a population that establishes long-term equilibrium over a heterogeneous environment. I also unified modeling concepts from entity resolution and capture-recapture to identify unique individuals of the population from overlapping images and infer total abundance. For the terrestrial ecosystem, I developed a stochastic state-space model to quantify the impact of climate change on the structural transformation of land cover types. The methods presented in this dissertation provide interpretable inference and employ statistical computing strategies to achieve scalability.Item Open Access Bayesian models and streaming samplers for complex data with application to network regression and record linkage(Colorado State University. Libraries, 2023) Taylor, Ian M., author; Kaplan, Andee, advisor; Fosdick, Bailey K., advisor; Keller, Kayleigh P., committee member; Koslovsky, Matthew D., committee member; van Leeuwen, Peter Jan, committee memberReal-world statistical problems often feature complex data due to either the structure of the data itself or the methods used to collect the data. In this dissertation, we present three methods for the analysis of specific complex data: Restricted Network Regression, Streaming Record Linkage, and Generative Filtering. Network data contain observations about the relationships between entities. Applying mixed models to network data can be problematic when the primary interest is estimating unconditional regression coefficients and some covariates are exactly or nearly in the vector space of node-level effects. We introduce the Restricted Network Regression model that removes the collinearity between fixed and random effects in network regression by orthogonalizing the random effects against the covariates. We discuss the change in the interpretation of the regression coefficients in Restricted Network Regression and analytically characterize the effect of Restricted Network Regression on the regression coefficients for continuous response data. We show through simulation on continuous and binary data that Restricted Network Regression mitigates, but does not alleviate, network confounding. We apply the Restricted Network Regression model in an analysis of 2015 Eurovision Song Contest voting data and show how the choice of regression model affects inference. Data that are collected from multiple noisy sources pose challenges to analysis due to potential errors and duplicates. Record linkage is the task of combining records from multiple files which refer to overlapping sets of entities when there is no unique identifying field. In streaming record linkage, files arrive sequentially in time and estimates of links are updated after the arrival of each file. We approach streaming record linkage from a Bayesian perspective with estimates calculated from posterior samples of parameters, and present methods for updating link estimates after the arrival of a new file that are faster than fitting a joint model with each new data file. We generalize a two-file Bayesian Fellegi-Sunter model to the multi-file case and propose two methods to perform streaming updates. We examine the effect of prior distribution on the resulting linkage accuracy as well as the computational trade-offs between the methods when compared to a Gibbs sampler through simulated and real-world survey panel data. We achieve near-equivalent posterior inference at a small fraction of the compute time. Motivated by the streaming data setting and streaming record linkage, we propose a more general sampling method for Bayesian models for streaming data. In the streaming data setting, Bayesian models can employ recursive updates, incorporating each new batch of data into the model parameters' posterior distribution. Filtering methods are currently used to perform these updates efficiently, however, they suffer from eventual degradation as the number of unique values within the filtered samples decreases. We propose Generative Filtering, a method for efficiently performing recursive Bayesian updates in the streaming setting. Generative Filtering retains the speed of a filtering method while using parallel updates to avoid degenerate distributions after repeated applications. We derive rates of convergence for Generative Filtering and conditions for the use of sufficient statistics instead of storing all past data. We investigate properties of Generative Filtering through simulation and ecological species count data.Item Open Access Bayesian shape-restricted regression splines(Colorado State University. Libraries, 2011) Hackstadt, Amber J., author; Hoeting, Jennifer, advisor; Meyer, Mary, advisor; Opsomer, Jean, committee member; Huyvaert, Kate, committee memberSemi-parametric and non-parametric function estimation are useful tools to model the relationship between design variables and response variables as well as to make predictions without requiring the assumption of a parametric form for the regression function. Additionally, Bayesian methods have become increasingly popular in statistical analysis since they provide a flexible framework for the construction of complex models and produce a joint posterior distribution for the coefficients that allows for inference through various sampling methods. We use non-parametric function estimation and a Bayesian framework to estimate regression functions with shape restrictions. Shape-restricted functions include functions that are monotonically increasing, monotonically decreasing, convex, concave, and combinations of these restrictions such as increasing and convex. Shape restrictions allow researchers to incorporate knowledge about the relationship between variables into the estimation process. We propose Bayesian semi-parametric models for regression analysis under shape restrictions that use a linear combination of shape-restricted regression splines such as I-splines or C-splines. We find function estimates using Markov chain Monte Carlo (MCMC) algorithms. The Bayesian framework along with MCMC allows us to perform model selection and produce uncertainty estimates much more easily than in the frequentist paradigm. Indeed, some of the work proposed in this dissertation has not been developed in parallel in the frequentist paradigm. We begin by proposing a semi-parametric generalized linear model for regression analysis under shape-restrictions. We provide Bayesian shape-restricted regression spline (Bayes SRRS) models and MCMC estimation algorithms for the normal errors, Bernoulli, and Poisson models. We propose several types of inference that can be performed for the normal errors model as well as examine the asymptotic behavior of the estimates for the normal errors model under the monotone shape-restriction. We also examine the small sample behavior of the proposed Bayes SRRS model estimates via simulation studies. We then extend the semi-parametric Bayesian shape-restricted regression splines to generalized linear mixed models. We provide a MCMC algorithm to estimate functions for the random intercept model with normal errors under the monotone shape restriction. We then further extend the semi-parametric Bayesian shape-restricted regression splines to allow the number and location of the knot points for the regression splines to be random and propose a reversible jump Markov chain Monte Carlo (RJMCMC) algorithm for regression function estimation under the monotone shape restriction. Lastly, we propose a Bayesian shape-restricted regression spline change-point model where the regression function is shape-restricted except at the change-points. We provide RJMCMC algorithms to estimate functions with change-points where the number and location of interior knot points for the regression splines are random. We provide a RJMCMC algorithm to estimate the location of an unknown change-point as well as a RJMCMC algorithm to decide between a model with no change-points and model with a change-point.Item Embargo Bayesian tree based methods for longitudinally assessed environmental mixtures(Colorado State University. Libraries, 2024) Im, Seongwon, author; Wilson, Ander, advisor; Keller, Kayleigh, committee member; Koslovsky, Matt, committee member; Neophytou, Andreas, committee memberIn various fields, there is interest in estimating the lagged association between an exposure and an outcome. This is particularly common in environmental health studies, where exposure to an environmental chemical is measured repeatedly during gestation for the assessment of its lagged effects on a birth outcome. The relationship between longitudinally assessed environmental mixtures and a health outcome is also of greater interest. For a single exposure, a distributed lag model (DLM) is a widely used method that provides an appropriate temporal structure for estimating the time-varying effects. For mixture exposures, a distributed lag mixture model is used to address the main effect of each exposure and lagged interactions among exposures. The main inferential goals include estimating the lag-specific effects and identifying a window of susceptibility, during which a fetus is particularly vulnerable. In this dissertation, we propose novel statistical methods for estimating exposure effects of longitudinally assessed environmental mixtures in various scenarios. First, we propose a method that can estimate a linear exposure-time-response function between mixture exposures and a count outcome that may be zero-inflated and overdispersed. To achieve this, we employ a Bayesian Pólya-Gamma data augmentation with a treed distributed lag mixture model framework. We apply the method to estimate the relationship between weekly average fine particulate matter (PM2.5) and temperature and pregnancy loss with live-birth identified conception time series design with administrative data from Colorado. Second, we propose a tree triplet structure to allow for heterogeneity in exposure effects in an environmental mixture exposure setting. Our method accommodates modifier and exposure selection, which allows for personalized and subgroup-specific effect estimation and windows of susceptibility identification. We apply the method to Colorado administrative birth data to examine the heterogeneous relationship between PM2.5 and temperature and birth weight. Finally, we introduce an R package dlmtree that integrates tree structured DLM methods into convenient software. We provide an overview of the embedded tree structured DLMs and use simulated data to demonstrate a model fitting process, statistical inference, and visualization.Item Open Access Bayesian treed distributed lag models(Colorado State University. Libraries, 2021) Mork, Daniel S., author; Wilson, Ander, advisor; Sharp, Julia, committee member; Keller, Josh, committee member; Neophytou, Andreas, committee memberIn many applications there is interest in regressing an outcome on exposures observed over a previous time window. This frequently arises in environmental epidemiology where either a health outcome on one day is regressed on environmental exposures (e.g. temperature or air pollution) observed on that day and several proceeding days or when a birth or children's health outcome is regressed on exposures observed daily or weekly throughout pregnancy. The distributed lag model (DLM) is a statistical method commonly implemented to estimate an exposure-time-response function by regressing the outcome on repeated measures of a single exposure over a preceding time period, for example, mean exposure during each week of pregnancy. Inferential goals include estimating the exposure-time-response function and identifying critical windows during which exposures can alter a health endpoint. In this dissertation, we develop novel formulations of Bayesian additive regression trees that allow for estimating a DLM. First, we propose treed distributed lag nonlinear models to estimate the association between weekly maternal exposure to air pollution and a birth outcome when the exposure-response relation is nonlinear. We introduce a regression tree-based model that accommodates a multivariate predictor along with parametric control for fixed effects. Second, we propose a tree-based method for estimating the association between repeated measures of a mixture of multiple pollutants and a health outcome. The proposed approach introduces regression tree pairs, which allow for estimation of marginal effects of exposures along with structured interactions that account for the temporal ordering of the exposure data. Finally, we present a framework to estimate a heterogeneous DLM in the presence of a potentially high dimensional set of modifying variables. We present simulation studies to validate the models. We apply these methods to estimate the association between ambient pollution exposures and birth weight for a Colorado, USA birth cohort.Item Open Access Causality and clustering in complex settings(Colorado State University. Libraries, 2023) Gibbs, Connor P., author; Keller, Kayleigh, advisor; Fosdick, Bailey, advisor; Koslovsky, Matthew, committee member; Kaplan, Andee, committee member; Anderson, Brooke, committee memberCausality and clustering are at the forefront of many problems in statistics. In this dissertation, we present new methods and approaches for drawing causal inference with temporally dependent units and clustering nodes in heterogeneous networks. To begin, we investigate the causal effect of a timeout at stopping an opposing team's run in the National Basketball Association (NBA). After formalizing the notion of a run in the NBA and in light of the temporal dependence among runs, we define the units under study with careful consideration of the stable unit-treatment-value assumption pertinent to the Rubin causal model. After introducing a novel, interpretable outcome based on the score difference, we conclude that while comebacks frequently occur after a run, it is slightly disadvantageous to call a timeout during a run by the opposing team. Further, we demonstrate that the magnitude of this effect varies by franchise, lending clarity to an oft-debated topic among sports' fans. Following, we represent the known relationships among and between genetic variants and phenotypic abnormalities as a heterogeneous network and introduce a novel analytic pipeline to identify clusters containing undiscovered gene to phenotype relations (ICCUR) from the network. ICCUR identifies, scores, and ranks small heterogeneous clusters according to their potential for future discovery in a large temporal biological network. We train an ensemble model of boosted regression trees to predict clusters' potential for future discovery using observable cluster features, and show the resulting clusters contain significantly more undiscovered gene to phenotype relations than expected by chance. To demonstrate its use as a diagnostic aid, we apply the results of the ICCUR pipeline to real, undiagnosed patients with rare diseases, identifying clusters containing patients' co-occurring yet otherwise unconnected genotypic and phenotypic information, some connections which have since been validated by human curation. Motivated by ICCUR and its application, we introduce a novel method called ECoHeN (pronounced "eco-hen") to extract communities from heterogeneous networks in a statistically meaningful way. Using a heterogeneous configuration model as a reference distribution, ECoHeN identifies communities that are significantly more densely connected than expected given the node types and connectivity of its membership without imposing constraints on the type composition of the extracted communities. The ECoHeN algorithm identifies communities one at a time through a dynamic set of iterative updating rules and is guaranteed to converge. To our knowledge this is the first discovery method that distinguishes and identifies both homogeneous and heterogeneous, possibly overlapping, community structure in a network. We demonstrate the performance of ECoHeN through simulation and in application to a political blogs network to identify collections of blogs which reference one another more than expected considering the ideology of its' members. Along with small partisan communities, we demonstrate ECoHeN's ability to identify a large, bipartisan community undetectable by canonical community detection methods and denser than modern, competing methods.Item Open Access Change-Point estimation using shape-restricted regression splines(Colorado State University. Libraries, 2016) Liao, Xiyue, author; Meyer, Mary C., advisor; Breidt, F. Jay, committee member; Homrighausen, Darren, committee member; Belfiori, Elisa, committee memberChange-Point estimation is in need in fields like climate change, signal processing, economics, dose-response analysis etc, but it has not yet been fully discussed. We consider estimating a regression function ƒm and a change-point m, where m is a mode, an inflection point, or a jump point. Linear inequality constraints are used with spline regression functions to estimate m and ƒm simultaneously using profile methods. For a given m, the maximum-likelihood estimate of ƒm is found using constrained regression methods, then the set of possible change-points is searched to find the ˆm that maximizes the likelihood. Convergence rates are obtained for each type of change-point estimator, and we show an oracle property, that the convergence rate of the regression function estimator is as if m were known. Parametrically modeled covariates are easily incorporated in the model. Simulations show that for small and moderate sample sizes, these methods compare well to existing methods. The scenario when the random error is from a stationary autoregressive process is also presented. Under such a scenario, the change-point and parameters of the stationary autoregressive process, such as autoregressive coefficients and the model variance, are estimated together via Cochran-Orcutt-type iterations. Simulations are conducted and it is shown that the change-point estimator performs well in terms of choosing the right order of the autoregressive process. Penalized spline-based regression is also discussed as an extension. Given a large number of knots and a penalty parameter which controls the effective degrees of freedom of a shape-restricted model, penalized methods give smoother fits while balance between under- and over-fitting. A bootstrap confidence interval for a change-point is established. By generating random change-points from a curve on the unit interval, we compute the coverage rate of the bootstrap confidence interval using penalized estimators, which shows advantages such as robustness over competitors. The methods are available in the R package ShapeChange on the Comprehensive R Archival Network (CRAN). Moreover, we discuss the shape selection problem when there are more than one possible shapes for a given data set. A project with the Forest Inventory & Analysis (FIA) scientists is included as an example. In this project, we apply shape-restricted spline-based estimators, among which the one-jump and double-jump estimators are emphasized, to time-series Landsat imagery for the purpose of modeling, mapping, and monitoring annual forest disturbance dynamics. For each pixel and spectral band or index of choice in temporal Landsat data, our method delivers a smoothed rendition of the trajectory constrained to behave in an ecologically sensible manner, reflecting one of seven possible “shapes”. Routines to realize the methodology are built in the R package ShapeSelectForest on CRAN, and techniques in this package are being applied for forest disturbance and attribute mapping across the conterminous U.S.. The Landsat community will implement techniques in this package on the Google Earth Engine in 2016. Finally, we consider the change-point estimation with generalized linear models. Such work can be applied to dose-response analysis, when the effect of a drug increases as the dose increases to a saturation point, after which the effect starts decreasing.Item Open Access Confidence regions for level curves and a limit theorem for the maxima of Gaussian random fields(Colorado State University. Libraries, 2009) French, Joshua, author; Davis, Richard A., advisorOne of the most common display tools used to represent spatial data is the contour plot. Informally, a contour plot is created by taking a "slice" of a three-dimensional surface at a certain level of the response variable and projecting the slice onto the two-dimensional coordinate-plane. The "slice" at each level is known as a level curve.Item Open Access Constrained spline regression and hypothesis tests in the presence of correlation(Colorado State University. Libraries, 2013) Wang, Huan, author; Meyer, Mary C., advisor; Opsomer, Jean D., advisor; Breidt, F. Jay, committee member; Reich, Robin M., committee memberExtracting the trend from the pattern of observations is always difficult, especially when the trend is obscured by correlated errors. Often, prior knowledge of the trend does not include a parametric family, and instead the valid assumption are vague, such as "smooth" or "monotone increasing," Incorrectly specifying the trend as some simple parametric form can lead to overestimation of the correlation, and conversely, misspecifying or ignoring the correlation leads to erroneous inference for the trend. In this dissertation, we explore spline regression with shape constraints, such as monotonicity or convexity, for estimation and inference in the presence of stationary AR(p) errors. Standard criteria for selection of penalty parameter, such as Akaike information criterion (AIC), cross-validation and generalized cross-validation, have been shown to behave badly when the errors are correlated and in the absence of shape constraints. In this dissertation, correlation structure and penalty parameter are selected simultaneously using a correlation-adjusted AIC. The asymptotic properties of unpenalized spline regression in the presence of correlation are investigated. It is proved that even if the estimation of the correlation is inconsistent, the corresponding projection estimation of the regression function can still be consistent and have the optimal asymptotic rate, under appropriate conditions. The constrained spline fit attains the convergence rate of unconstrained spline fit in the presence of AR(p) errors. Simulation results show that the constrained estimator typically behaves better than the unconstrained version if the true trend satisfies the constraints. Traditional statistical tests for the significance of a trend rely on restrictive assumptions on the functional form of the relationship, e.g. linearity. In this dissertation, we develop testing procedures that incorporate shape restrictions on the trend and can account for correlated errors. These tests can be used in checking whether the trend is constant versus monotone, linear versus convex/concave and any combinations such as, constant versus increase and convex. The proposed likelihood ratio test statistics have an exact null distribution if the covariance matrix of errors is known. Theorems are developed for the asymptotic distributions of test statistics if the covariance matrix is unknown but the test statistics use a consistent estimator of correlation into their estimation. The comparisons of the proposed test with the F-test with the unconstrained alternative fit and the one-sided t-test with simple regression alternative fit are conducted through intensive simulations. Both test size and power of the proposed test are favorable, smaller test size and greater power in general, comparing to the F-test and t-test.