Theses and Dissertations

Permanent URI for this collectionhttps://hdl.handle.net/10217/100519

Browse

Now showing 1 - 20 of 97

Open Access
Statistical inference on reproducibility in high-throughput experiments
(Colorado State University. Libraries, 2025) Ellingworth, Austin, author; Guan, Yawen, advisor; Zhou, Wen, advisor; Keller, Kayleigh, committee member; Kokoszka, Piotr, committee member; Mykles, Donald, committee member
Results in high-throughput genomics are known to have large variability across independent replicate studies. For this reason, the formal assessment of the agreement of results for many hypotheses across replicate studies has been a burgeoning area of research in statistical genomics. Hypotheses with consistent results are called reproducible, while those without consistency are called irreproducible. The presence of reproducibility in experimental research is critical, as it ensures the validity of findings. In this dissertation, we devise three methods for assessing the reproducibility of results from high-throughput genomic studies, each with advantages under certain settings. First, we notice that many of the existing approaches to assessing the reproducibility of results from two replicate high-throughput genomics studies either depend on strict parametric assumptions on available summary statistics or fail to properly consider the consistency of reproducible signal across experiments in addition to its strength. Motivated by Philtron et al. (2018), we introduce a function based on the rankings of summary statistics from each experiment to define a notion for reproducibility and identify reproducible hypotheses. The proposed nonparametric statistic takes into account both the signal strength and consistency of results. By examining the geometry of the space of ranks of summary statistics and utilizing the negative association dependence structure of ranks, a novel procedure is introduced for recognizing reproducible findings while controlling the false discovery rate (FDR). This method controls FDR under relatively mild assumptions. The theoretical FDR findings are validated through simulations that also reveal the method to be more powerful than existing procedures. Finally, the procedure is applied to two large-scale TWAS datasets, uncovering reproducible features. Second, we notice that existing methods for assessing the reproducibility of high-throughput studies ignore the known group structures of genetic features, such as transcripts belonging to the same gene or genes belonging to the same pathway. Motivated by Li et al. (2011) and Liu et al. (2016), we present an empirical Bayesian framework for reproducibility that incorporates this group structure. Additionally, we introduce algorithms for testing reproducibility at the hypothesis and group levels that maintain control of posterior FDR. Next, a data-driven estimation procedure based on the EM algorithm is proposed to enable the application of these algorithms when the parameters it relies on are unknown. In simulation, we show that the inclusion of the group structure in the hypothesis-level procedure leads to superior performance in terms of power and FDR control compared to more naive methods, and that the group-level procedure outperforms methods that rely on aggregation prior to analysis. The proposed procedures enable researchers to integrate known group structure information into the reproducibility problem, yielding higher-quality results. Finally, while there is a dearth of existing literature for analyzing reproducibility across two replicate studies, there are strikingly few methods that consider cases with more than two studies, and those that exist generally assume the distributions of irreproducible summary statistics are known. Leveraging Kendall's coefficient of concordance, we introduce a rank-based statistic that quantifies the agreement of results for a particular hypothesis without enforcing such strict assumptions. Noticing that in real high-throughput genomic settings, we have many "housekeeping" genes that are unrelated to the disease of interest and thus can be considered as a control set, we utilize conformal inferential and bootstrapping techniques to devise three procedures for calculating approximate p-values from a set of the proposed statistics that can be used to discover reproducible hypotheses at a nominal level of FDR. Simulation studies reveal that the three methods show preferable performance to existing methods in terms of power and FDR control. Applying the methods to single-cell expression data from five COVID-19 studies, we show that the proposed statistic and its procedures can identify genes and gene pathways associated with COVID-19.
Open Access
Bayesian approaches to extreme value modeling, with applications to wildfires
(Colorado State University. Libraries, 2025) Lawler, Elizabeth S., author; Shaby, Benjamin, advisor; Cooley, Daniel, committee member; Zhou, Tianjian, committee member; Mahmoud, Hussam, committee member
The growing frequency and size of wildfires across the US necessitates accurate quantitative assessment of evolving wildfire behavior to predict risk from future extreme wildfires. In Chapter 2, we build a joint model of wildfire counts and burned areas, regressing key model parameters on climate and demographic covariates. We use extended generalized Pareto distributions to model the full distribution of burned areas, capturing both moderate and extreme sizes, while leveraging extreme value theory to focus particularly on the right tail. We model wildfire counts using a zero-inflated negative binomial model and join the wildfire counts and burned areas sub-models via a temporally varying shared random effect. Our model successfully captures the trends of wildfire counts and burned areas. By investigating the predictive power of different sets of covariates, we find that fire indices are better predictors of wildfire burned area behavior than individual climate covariates, whereas climate covariates are influential drivers of wildfire occurrence behavior. Recent advances in multivariate extreme value modeling leverage a geometric perspective, using the shape of the multivariate point cloud and its connection to the Lebesgue joint density, to make inference on joint tail probabilities. While the original statistical framework was fully parametric, relying on a gauge function that uniquely defines the shape for a given density, newer methods have introduced semi- and non-parametric alternatives to increase flexibility. In Chapter 3, we propose a modeling approach that retains the simplicity of the parametric framework but adds flexibility by using Bayesian model averaging (BMA) to improve prediction of tail risk probabilities. In contrast to previous works that rely solely on a truncated radial likelihood, we propose using a censored likelihood, which we find consistently outperforms the truncated radial likelihood, particularly in small-sample settings. To generate predictions, we use a simple importance sampling scheme that matches the accuracy of more complex methods at a fraction of the computational cost. Finally, we apply our approach to two fire weather indices, which are designed to capture somewhat orthogonal aspects of fire risk, to illustrate the practical utility of our method in environmental applications.
Open Access
New developments on linear regression with random design and high-dimensional mediation analysis
(Colorado State University. Libraries, 2025) Zhang, Zifeng, author; Wang, Haonan, advisor; Zhou, Wen, advisor; Kokoszka, Piotr, committee member; Breidt, F. Jay, committee member; Luo, Jie, committee member
Linear regression is arguably the most widely used statistical method. In this thesis, we study the robustness of the least squares estimator when regressors are random and the errors are correlated with unknown correlation structure. We further investigate the small-sample robustness of the least squares estimator and offer a new geometric perspective on the F-test. Motivated by the Baron-Kenny approach, we also apply linear models to high-dimensional mediation analysis with the treatment-by-mediator interaction. In linear regression with fixed regressors and correlated errors, the conventional wisdom is to modify the variance-covariance estimator to accommodate the known correlation structure of the errors. We depart from the literature by showing that with random regressors, linear regression inference is robust to correlated errors with unknown correlation structure. The existing theoretical analyses for linear regression are no longer valid because even the asymptotic normality of the least-squares coefficients breaks down in this regime. We first prove the asymptotic normality of the t statistics by establishing their Berry–Esseen bounds based on a novel probabilistic analysis of self-normalized statistics. We then study the local power of the corresponding t tests and show that, perhaps surprisingly, error correlation can even enhance power in the regime of weak signals. Overall, our results show that linear regression is applicable more broadly than the conventional theory suggests, and further demonstrate the value of randomization to ensure robustness of inference. Next, we explore the small sample robustness of the least squares estimator by the F and t tests. The F distribution is one of the most widely applied statistical tools in small sample inference, and it has been recognized that its definition does not necessarily require normality, but merely a spherical distribution. While existing literature has touched upon the relationship between the F and spherical distributions, these discussions remain either incomplete or not rigorously structured. We provide a geometric perspective that clearly delineates the relationship between F and spherical distributions, and introduce a novel definition of the F distribution. Perhaps surprisingly, based on this new definition, in the linear model, the validity of the ordinary least squares F-test and t-test is preserved under spherical symmetry of the design matrix, even if the error terms have non-zero means, heteroscedasticity, strong correlations, or heavy tails. Finally, we apply the linear model (Baron-Kenny approach) in the mediation analysis under high-dimensional setting with interaction. Mediation analysis has been commonly applied in various fields, including economics, finance, and genomic and genetic research. A key challenge in this domain is the inference of natural direct and indirect effects in the presence of potential interactions between treatment and high-dimensional mediators. These interactions often give rise to moderator effects, which are further complicated by the intricate dependencies among the mediators. In this paper, we introduce a new inference procedure that addresses this challenge. By incorporating a non-convex penalty into the outcome model, our method effectively identifies important mediators while accounting for their interactions with the treatments, which admits the guaranteed oracle property. Leveraging the oracle property, we can exploit a projection onto the mediator model, guided by the estimated important direction in the mediator space. We establish the asymptotic normality of both natural indirect and direct effects for inference. Additionally, we develop an algorithm that utilizes the overlapping group SCAD penalty to promote heredity structure among the main effects and interactions, which comes with provable guarantees. Our extensive numerical studies, comparing our method with other existing approaches across various scenarios, demonstrate its effectiveness. To illustrate the practical application of our methods, we conduct a study investigating the impact of childhood trauma on cortisol stress reactivity. Using DNA methylation loci as mediators, we uncover several new loci that remain undetected when interactions are ignored.
Open Access
Statistical modeling of high-dimensional categorical data with applications to mutation fitness and sparse text topic analysis
(Colorado State University. Libraries, 2025) Dai, Bingying, author; Zhao, Yunpeng, advisor; Zhou, Wen, advisor; Cooley, Daniel, committee member; Zhou, Tianjian, committee member; Blanchard, Nathaniel, committee member
The growing availability of large-scale categorical data has created a strong need for statistical methods capable of modeling high-dimensional discrete structures. Such data are common in fields like biological sequence analysis, natural language processing, and social network modeling, where observations often involve thousands of categorical or count-valued variables, exhibiting complex dependencies and high sparsity. Conventional statistical models, designed for continuous or low-dimensional settings, often fall short in capturing the latent structure and combinatorial complexity of such data. This dissertation introduces new statistical modeling frameworks and estimation techniques tailored for high-dimensional categorical data, supported by theoretical guarantees and validated through applications in protein sequence analysis and topic modeling. The first part of the dissertation focuses on modeling mutational fitness in proteins, where predicting the effects of amino acid mutations is challenging due to the vast combinations of sites and amino acid types. We propose a new framework for analyzing protein sequences using the Potts model with node-wise high-dimensional multinomial regression. Our method identifies key site interactions and important amino acids, quantifying mutation effects through evolutionary energy derived from model parameters. It encourages sparsity in both site-wise and amino acid-wise dependencies through element-wise and group sparsity. We have established, for the first time to our knowledge, the ℓ2 convergence rate for estimated parameters in the high-dimensional Potts model using sparse group Lasso, matching the existing minimax lower bound for high-dimensional linear models with a sparse group structure, up to a factor depending only on the multinomial nature of the Potts model. This theoretical guarantee enables accurate quantification of estimated energy changes. Additionally, we incorporate structural data into our model by applying penalty weights across site pairs. Our method outperforms others in predicting mutation fitness, as demonstrated by comparisons with high-throughput mutagenesis experiments across 12 protein families. The second part focus on topic modeling which is a fundamental technique for uncovering latent semantic structures in large text corpora. While traditional probabilistic models such as Latent Dirichlet Allocation and probabilistic Latent Semantic Indexing have been widely adopted, they often rely on assumptions that do not align well with the properties of real-world text data, particularly the pervasive presence of zero counts. These structural zeros, especially in short documents, often reflect more than random sampling variability and can indicate meaningful absence. To address these limitations, we propose a novel Zero-Inflated Poisson model that incorporates three essential components: a zero-inflation mechanism explicitly accounting for excess zeros that arise from structural rather than sampling sources; a functional link connecting the zero-inflation probability to the Poisson intensity to capture informative missingness related to topic prevalence, and document-level random effects accounting for unobserved heterogeneity across documents. An efficient alternating optimization algorithm is developed for intensity parameters estimation under a low-rank structure. We establish finite-sample error bounds for topic-word matrix recovery via a vertex hunting procedure. Empirical studies on synthetic datasets show that the model outperforms existing methods in sparse and heterogeneous settings. Application to a real-world corpus of statistical publications further confirms the model's ability to recover meaningful topics and track their evolution over time.
Embargo
Methods for effect modification with multivariate environmental exposures
(Colorado State University. Libraries, 2025) Demateis, Danielle, author; Wilson, Ander, advisor; Keller, Kayleigh, advisor; Cooley, Dan, committee member; Wang, Tianying, committee member; Magzamen, Sheryl, committee member
Humans are exposed to a multitude of environmental insults every day. Exposures such as air pollution, heat and extreme weather, heavy metals, and environmental chemicals are known to be linked to adverse health outcomes. There is interest in understanding how multivariate exposures, including repeated measures of the same exposure over time for the same observation and measures of multiple exposures at a single time point, impact health. Several statistical approaches have been proposed for the analysis of multivariate exposure data. Two commonly used methods are distributed lag models (DLMs) for repeated measures of exposure and Bayesian kernel machine regression (BKMR) for multiple exposures observed at a single time point. These methods and their variants are widely used in environmental health studies. However, they lack flexibility to estimate effect modification in most settings. In this dissertation, I develop methods to include effect modification in both DLMs and BKMR. The first method is the distributed lag interaction model (DLIM) that extends the standard distributed lag framework to allow for modification of the exposure-time-response function by a single continuous variable. I use a cross-basis, or bi-dimensional function space, inspired by the distributed lag non-linear framework to simultaneously describe the temporal and modification structure. Next, I developed a distributed lag interaction model with index modification (DLIM-IM) that allows for modification of the exposure-time-response function by multiple modifiers via a data-derived modification index. I use a Bayesian hierarchical framework to simultaneously estimate the exposure-time-response function and a weighted modification index, and I allow for selection on the candidate modifiers. Finally, I propose and evaluate extensions of the BKMR framework to include effect modification by a categorical modifier. I propose a new version of BKMR with a separable covariance function that allows for increased flexibility to estimate effect modification as well a comparing alternative ways to apply BKMR for assessing modification For each of these methods, I validated these methods through simulation and applied them to multiple data sets to demonstrate their application. I have made open-source software for the methods publicly available on CRAN and GitHub.
Embargo
Accounting for spatial confounding in large scale epidemiological studies
(Colorado State University. Libraries, 2025) Rainey, Maddie J., author; Keller, Kayleigh, advisor; Wilson, Ander, committee member; Guan, Yawen, committee member; Anderson, Brooke, committee member
Epidemiological analyses of environmental risk factors often include spatially-varying exposures and outcomes. Unmeasured, spatially-varying factors can lead to confounding bias in estimates of associations. In this dissertation, I present a comparison of existing and new methods that use thin plate regression splines to mitigate spatial confounding bias for both cross-sectional and longitudinal analyses. I also introduce a metric to quantify the spatial smoothing induced by thin plate regression splines in varying geographic domains. I first investigate cross-sectional data, directly comparing existing approaches based on information criteria and cross-validation metrics and additionally introduce a hybrid method to selection that combines features from multiple existing approaches. Based on a simulation study, I make a recommendation for the best approach for different settings and demonstrate their use in a study of environmental exposures on birth weight in a Colorado cohort. Next, I develop an effective bandwidth metric that quantifies the relationship between spatial splines and the range of implied spatial smoothing. I present an R Shiny application, spconfShiny, that provides a user-friendly platform to compute the metric. spconfShiny can be accessed at https://g2aging.shinyapps.io/spconfShiny/. We illustrate the procedure to compute the effective bandwidth and demonstrate its use for different numbers of spatial splines across England, India, Ireland, Northern Ireland, and the United States. Finally, I extend two cross-sectional methods for spatial confounding adjustment to model longitudinal and time-to-event data. The additional temporal component existing in the data requires an additional selection of which coordinates to use to create thin-plate regression splines basis: the spatial coordinates, temporal coordinates, or both the spatial and temporal coordinates. I demonstrate these methods for mixed models, generalized estimating equation models, and a proportional hazard regression framework. I demonstrate the application of these methods in a study of tropical cyclone wind exposures on preterm birth in a North Carolina cohort.
Open Access
Tail dependence: application, exploration, and development of novel methods
(Colorado State University. Libraries, 2025) Wixson, Troy P., author; Cooley, Daniel, advisor; Shaby, Benjamin, advisor; Huang, Dongzhou, committee member; Wang, Tianying, committee member; Barnes, Elizabeth, committee member
The study of multivariate extreme events is largely concerned with modeling the dependence in the tail of the joint distribution. The understanding of extremal dependence and methodology for modeling that dependence has been an active research field over the past few decades and we contribute to that literature with three projects that are detailed in this dissertation. In the first project we consider the challenge of assessing the changing risk of wildfires. Wildfire risk is greatest during high winds after sustained periods of dry and hot conditions. This chapter is a statistical extreme event risk attribution study which aims to answer whether extreme wildfire seasons are more likely now than under past climate. This requires modeling temporal dependence at extreme levels. We propose the use of transformed-linear time series models which are constructed similarly to traditional ARMA models while having a dependence structure that is tied to a widely used framework for extremes (regular variation). We fit the models to the extreme values of the seasonally adjusted Fire Weather Index (FWI) time series to capture the dependence in the upper tail for past and present climate. Ten-thousand fire seasons are simulated from each fitted model and we compare the proportion of simulated high-risk fire seasons to quantify the increase in risk. Our method suggests that the risk of experiencing an extreme wildfire season in Grand Lake, Colorado under current climate has increased dramatically compared to the risk under the climate of the mid-20th century. Our method also finds some evidence of increased risk of extreme wildfire seasons in Quincy, California, but large uncertainties do not allow us to reject a null hypothesis of no change. In the second project we explore a fundamental characterization of tail dependence and develop a method to classify data into the two regimes. Classifying a data set as asymptotically dependent (AD) or asymptotically independent (AI) is a necessary early choice in the modeling of multivariate extremes. These two dependence regimes are defined asymptotically which complicates inference as practitioners have finite samples. We perform a series of experiments to determine whether a finite sample has enough information for a convolutional neural network to reliably distinguish between these regimes in the bivariate case. Along the way we develop a new classification tool for practitioners which we call nnadic as it is a Neural Network for Asymptotic Dependence/Independence Classification. This tool accurately classifies 95\% of test datasets and is robust to a wide range of sample sizes. The datasets which we are unable to correctly classify tend to either be nearly exactly independent or exhibit near perfect dependence, which are boundary cases for both the AD and AI models used for training. In the third project we consider the challenge of using likelihood methods for models developed for the tail of the distribution. Many multivariate extremes models have intractable likelihoods thus practitioners must use alternative fitting methods and likelihood-based methods for uncertainty quantification and model selection are unavailable. We develop a proxy-likelihood estimator for multivariate extremes models. Our method is based on the tail pairwise dependence (TPD) which is a summary measure of the dependence in the tail of any multivariate extremes model. The TPD parameter has a one-to-one relationship with the dependence parameter of the HR distribution. We use the HR distribution as a proxy for the likelihood in a composite likelihood approach. The method is demonstrated using the transformed linear extremes time series (TLETS) models of Mhatre & Cooley (2024).
Open Access
Multi-channel factor analysis: properties, extensions, and applications
(Colorado State University. Libraries, 2024) Stanton, Gray, author; Wang, Haonan, advisor; Scharf, Louis, advisor; Kokoszka, Piotr, committee member; Wang, Tianying, committee member; Luo, Jie, committee member
Multi-channel Factor Analysis (MFA) extends factor analysis to the multi-channel or multi-view setting, where latent common factors influence all channels while distinct factors are specific to individual channels. The within- and across-channel covariance is determined by a low-rank matrix, a block-diagonal matrix with low-rank blocks, and a diagonal matrix, which provides a parsimonious model for both covariances. MFA and related multi-channel methods for data fusion are discussed in Chapter 1. Under conditions on the channel sizes and factor numbers, the results of Chapter 2 show that the generic global identifiability of the aforementioned covariance matrices can be guaranteed a priori, and the estimators obtained by maximizing a Gaussian likelihood are shown to be consistent and asymptotically normal even under misspecification. To handle temporal correlation in the latent factors, Chapter 3 introduces Multi-channel Factor Spectral Analysis (MFSA). Results for the identifiability and parameterization properties of the MFSA spectral density model are derived, and a Majorization-Minimization procedure to optimize the Whittle pseudo-likelihood is designed to estimate the MFSA parameters. A simulation study is conducted to explore how temporal correlations in the latent factors affect estimation, and it is demonstrated that MFSA significantly outperforms MFA when the factor series are highly autocorrelated. In Chapter 4, a locally stationary joint multivariate Gaussian process with MFA-type cross-sectional covariance is developed to model multi-vehicle trajectories in a highway environment. A dynamic model-based clustering procedure is designed to partition cohorts of nearby vehicles into pods based on the stability of the intra-pod relative vehicle configuration. The performance of this procedure is illustrated by its application to the Next GENeration SIMulation dataset of vehicle trajectories on U.S. Highway 101.
Open Access
A novel approach to statistical problems without identifiability
(Colorado State University. Libraries, 2024) Adams, Addison D., author; Wang, Haonan, advisor; Zhou, Tianjian, advisor; Kokoszka, Piotr, committee member; Shaby, Ben, committee member; Ray, Indrakshi, committee member
In this dissertation, we propose novel approaches to random coefficient regression (RCR) and the recovery of mixing distributions under nonidentifiable scenarios. The RCR model is an extension of the classical linear regression model that accounts for individual variation by treating the regression coefficients as random variables. A major interest lies in the estimation of the joint probability distribution of these random coefficients based on the observable samples of the outcome variable evaluated for different values of the explanatory variables. In Chapter 2, we consider fixed-design RCR models, under which the coefficient distribution is not identifiable. To tackle the challenges of nonidentifiability, we consider an equivalence class, in which each element is a plausible coefficient distribution that, for each value of the explanatory variables, yields the same distribution for the outcome variable. In particular, we formulate the approximations of the coefficient distributions as a collection of stochastic inverse problems, allowing for a more flexible nonparametric approach with minimal assumptions. An iterative approach is proposed to approximate the elements by incorporating an initial guess of a solution called the global ansatz. We further study its convergence and demonstrate its performance through simulation studies. The proposed approach is applied to a real data set from an acupuncture clinical trial. In Chapter 3, we consider the problem of recovering a mixing distribution, given a component distribution family and observations from a compound distribution. Most existing methods are restricted in scope in that they are developed for certain component distribution families or continuity structures of mixing distributions. We propose a new, flexible nonparametric approach with minimal assumptions. Our proposed method iteratively steps closer to the desired mixing distribution, starting from a user-specified distribution, and we further establish its convergence properties. Simulation studies are conducted to examine the performance of our proposed method. In addition, we demonstrate the utility of our proposed method through its application to two sets of real-world data, including prostate cancer data and Shakespeare's canon word count.
Open Access
Bayesian tree based methods for longitudinally assessed environmental mixtures
(Colorado State University. Libraries, 2024) Im, Seongwon, author; Wilson, Ander, advisor; Keller, Kayleigh, committee member; Koslovsky, Matt, committee member; Neophytou, Andreas, committee member
In various fields, there is interest in estimating the lagged association between an exposure and an outcome. This is particularly common in environmental health studies, where exposure to an environmental chemical is measured repeatedly during gestation for the assessment of its lagged effects on a birth outcome. The relationship between longitudinally assessed environmental mixtures and a health outcome is also of greater interest. For a single exposure, a distributed lag model (DLM) is a widely used method that provides an appropriate temporal structure for estimating the time-varying effects. For mixture exposures, a distributed lag mixture model is used to address the main effect of each exposure and lagged interactions among exposures. The main inferential goals include estimating the lag-specific effects and identifying a window of susceptibility, during which a fetus is particularly vulnerable. In this dissertation, we propose novel statistical methods for estimating exposure effects of longitudinally assessed environmental mixtures in various scenarios. First, we propose a method that can estimate a linear exposure-time-response function between mixture exposures and a count outcome that may be zero-inflated and overdispersed. To achieve this, we employ a Bayesian Pólya-Gamma data augmentation with a treed distributed lag mixture model framework. We apply the method to estimate the relationship between weekly average fine particulate matter (PM2.5) and temperature and pregnancy loss with live-birth identified conception time series design with administrative data from Colorado. Second, we propose a tree triplet structure to allow for heterogeneity in exposure effects in an environmental mixture exposure setting. Our method accommodates modifier and exposure selection, which allows for personalized and subgroup-specific effect estimation and windows of susceptibility identification. We apply the method to Colorado administrative birth data to examine the heterogeneous relationship between PM2.5 and temperature and birth weight. Finally, we introduce an R package dlmtree that integrates tree structured DLM methods into convenient software. We provide an overview of the embedded tree structured DLMs and use simulated data to demonstrate a model fitting process, statistical inference, and visualization.
Open Access
Advances in Bayesian spatial statistics for ecology and environmental science
(Colorado State University. Libraries, 2024) Wright, Wilson J., author; Hooten, Mevin B., advisor; Cooley, Daniel S., advisor; Keller, Kayleigh P., committee member; Kaplan, Andee, committee member; Ross, Matthew R. V., committee member
In this dissertation, I develop new Bayesian methods for analyzing spatial data from applications in ecology and environmental science. In particular, I focus on methods for mechanistic spatial models and binary spatial processes. I first consider the distribution of heavy metal pollution from a mining road in Cape Krusenstern, Alaska, USA. I develop a mechanistic spatial model that uses the physical process of atmospheric dispersion to characterize the spatial structure in these data. This approach directly incorporates scientific knowledge about how pollutants spread and provides inferences about this process. To assess how the heavy metal pollution impacts the vegetation community in Cape Krusenstern, I also develop a new model that represents plant cover for multiple species using clipped Gaussian processes. This approach is applicable to multiscale and multivariate binary processes that are observed at point locations — including multispecies plant cover data collected using the point intercept method. By directly analyzing the point-level data, instead of aggregating observations to the plot-level, this model allows for inferences about both large-scale and small-scale spatial dependence in plant cover. Additionally, it also incorporates dependence among different species at the small spatial scale. The third model I develop is motivated by ecological studies of wildlife occupancy. Similar to plant cover, species occurrence can be modeled as a binary spatial process. However, occupancy data are inherently measured at areal survey units. I develop a continuous-space occupancy model that accounts for the change of spatial support between the occurrence process and the observed data. All of these models are implemented using Bayesian methods and I present computationally efficient methods for fitting them. This includes a new surrogate data slice sampler for implementing models with latent nearest neighbor Gaussian processes.
Open Access
Applications of least squares penalized spline density estimator
(Colorado State University. Libraries, 2024) Jing, Hanxiao, author; Meyer, Mary, advisor; Cooley, Daniel, committee member; Kokoszka, Piotr, committee member; Berger, Joshua, committee member
The spline-based method stands as one of the most common nonparametric approaches. The work in this dissertation explores three applications of the least squares penalized spline density estimator. Firstly, we present a novel hypothesis test against the unimodality of density functions, based on unimodal and bimodal estimates of the density function, using penalized splines. The test statistic is the difference in the least-squares criterion, between these fits. The distribution of the test statistics under the null hypothesis is estimated via simulated data sets from the unimodal fit. Large sample theory is derived and simulation studies are conducted to compare its performance with other common methods across various scenarios, alongside a real-world application involving neuro-transmission data from guinea pig brains. Secondly, we tackle the deconvolution density estimation problem, introducing the penalized splines deconvolution estimator. Building upon the results gained from piecewise constant splines, we achieve a cube-root convergence rate for piecewise quadratic splines and uniform errors. Moreover, we derive large sample theories for the penalized spline estimator and the constrained spline estimator. Simulation studies illustrate the competitive performance of our estimators compared to the kernel estimators across diverse scenarios. Lastly, drawing inspiration from the preceding applications, we develop a hypothesis test to discern whether the underlying density is unimodal or multimodal, given data with measurement error. Under the assumption of uniform errors, we introduce the test and derive the test statistic. Simulations are conducted to show the performance of the proposed test under different conditions.
Open Access
Population size estimation using the modified Horvitz-Thompson estimator with estimated sighting probability
(Colorado State University. Libraries, 1996) Wong, Char-Ngan, author; Bowden, David C., advisor
Wildlife aerial population surveys usually use a two-stage sampling technique. The first stage involves dividing the whole survey area into smaller land units, which we called the primary units, and then taking a sample from those. In the second stage, an aerial survey of the selected units is made in an attempt to observe (count) every animal. Some animals, usually occurring in groups, are not observed for a variety of reasons. Estimates from these surveys are plagued with two major sources of errors, namely, errors due to sampling variation in both stages. The first error may be controlled by choosing a suitable sampling plan for the first stage. The second error is also termed "visibility bias", which acknowledges that only a portion of the groups in a sampled land unit will be enumerated. The objective of our study is to provide improved variance estimators over those provided by Steinhorst and Samuel (1989) and to evaluate performances of various corresponding interval procedures for estimating population size. For this purpose, we have found an asymptotically unbiased estimator for the approximate variance of the population size estimator when sighting probabilities of groups are unknown and fitted with a logistic model. We have broken down the approximate variance term into three components, namely, error due to sampling of primary units, error due to sighting of groups in second stage sampling and error due all three components separately in order to get a better insight to error control. Simplified versions of variance estimators are provided when all primary units are surveyed and for stratified random sampling of primary units. Third central moment of population size estimator was also obtained. Simulation studies were conducted to evaluate performances of our asymptotically unbiased variance estimators and of confidence interval procedures such as the large sample procedure, with and without transformation, for constructing 90% and 95% confidence intervals for the population size. Confidence intervals for the population size were also constructed by assuming that the distribution of log(T-T) is normally distributed, where f is the population size estimate and T is the number of animals seen in a sample obtained from a population survey. From our simulation results, we observed that the population size is estimated with negligible bias (according to Cochran's (1977) working rule) with a sample of at least 100 groups of elk obtained from a population survey when sighting probabilities are known. When sighting probabilities are unknown, one needs to conduct a sightability survey to obtain a sample, independent of the sample obtained from a population survey, for fitting a logistic model to estimate sighting probabilities of sighted groups in the sample obtained from the population survey. In this case, the population size is also estimated with negligible bias when the sample size of both samples is at least 100 groups of elk. We also observed that when sighting probabilities are known, we needed a sample of at least 348 groups of elk from a population survey to obtain reasonable coverage rates of the true population size. When sighting probabilities are unknown and estimated via logistic regression, the size of both samples is at least 428 groups of elk for obtaining reasonable coverage rates of the true population size. Among all these confidence intervals, we found that those approximate confidence intervals constructed based on the assumption that log (T-T) is normally distributed and using the delta method have better coverage rates and shorter estimated expected interval widths. Confidence intervals for the population size using bootstrapping were also evaluated. We were unable to find an existing bootstrapping procedure which could be directly applied to our problem. We have, therefore, proposed a couple of bootstrapping procedures for obtaining a sample to fit a logistic model and a couple of bootstrapping procedures for obtaining a sample to construct a population size estimate. With 1000 pairs of independent samples from a sightability survey and a population survey, each sample of size 107 groups of elk and using 500 bootstrap iterations, we obtained reasonable coverage rates of the true population size. Our other problem is model selection of a logistic model for the unknown sighting probabilities. We evaluated the performance of the population size estimator and our variance estimator when we fit a simpler model. For this purpose, we have derived theoretical expressions for the bias of the population size estimator and the mean-squared-error. We found, from our simulation results of fitting a couple of models simpler than the full model, that the population size was still well estimated for the fitted model based only on group size but was severely overestimated for the fitted model based only on percent of vegetation cover. For both fitted models, our variance estimator overestimated the observed variance of 1000 simulated population size estimates. We also found that the approximate expression of the expected value of the population size estimator we derived for a fitted model simpler than the full model has negligible bias (by Cochran's (1977) working rule) relative to the average of those 1000 simulated population size estimates. The approximate expression of the variance of the population size estimator we derived for this case somewhat underestimated the observed variance of those 1000 simulated population size estimates. Both approximate expressions apparently give us an idea of the expected size of the population size estimate and its variance when the fitted model is not the full model.
Embargo
Functional methods in outlier detection and concurrent regression
(Colorado State University. Libraries, 2024) Creutzinger, Michael L., author; Cooley, Daniel, advisor; Sharp, Julia L., advisor; Koslovsky, Matt, committee member; Liebl, Dominik, committee member; Ortega, Francisco, committee member
Functional data are data collected on a curve, or surface, over a continuum. The growing presence of high-resolution data has greatly increased the popularity of using and developing methods in functional data analysis (FDA). Functional data may be defined differently from other data structures, but similar ideas apply for these types of data including data exploration, modeling and inference, and post-hoc analyses. The methods presented in this dissertation provide a statistical framework that allows a researcher to carry out an analysis of functional data from "start to finish''. Even with functional data, there is a need to identify outliers prior to conducting statistical analysis procedures. Existing functional data outlier detection methodology requires the use of a functional data depth measure, functional principal components, and/or an outlyingness measure like Stahel-Donoho. Although effective, these functional outlier detection methods may not be easily interpreted. In this dissertation, we propose two new functional outlier detection methods. The first method, Practical Outlier Detection (POD), makes use of ordinary summary statistics (e.g., minimum, maximum, mean, variance, etc). In the second method, we developed a Prediction Band Outlier Detection (PBOD) method that makes use of parametric, simultaneous, prediction bands that meet nominal coverage levels. The two new outlier detection methods were compared to three existing outlier detection methods: MS-Plot, Massive Unsupervised Outlier Detection, and Total Variation Depth. In the simulation results, POD performs as well, or better, than its counterparts in terms of specificity, sensitivity, accuracy, and precision. Similar results were found for PBOD, except for noticeably smaller values of specificity and accuracy than all other methods. Following data exploration and outlier detection, researchers often model their data. In FDA, functional linear regression uses a functional response Yi(t) and scalar and/or functional predictors, Xi(t). A functional concurrent regression model is estimated by regressing Yi on Xi pointwise at each sampling point, t. After estimating a regression model (functional or non-functional), it is common to estimate confidence and prediction intervals for parameter(s), including the conditional mean. A common way to obtain confidence/prediction intervals for simultaneous inference across the sampling domain is to use resampling methods (e.g., bootstrapping or permutation). We propose a new method for estimating parametric, simultaneous confidence and prediction bands for a functional concurrent regression model, without the use of resampling. The method uses Kac-Rice formulas for estimation of a critical value function, which is used with a functional pivot to acquire the simultaneous band. In the results, the proposed method meets nominal coverage levels for both confidence and prediction bands. The method we propose is also substantially faster to compute than methods that require resampling techniques. In linear regression, researchers may also assess if there are influential observations that may impact the estimates and results from the fitted model. Studentized difference in fits (DFFITS), studentized difference in regression coefficient estimates (DFBETAS), and/or Cook's Distance (D) can all be used to identify influential observations. For functional concurrent regression, these measures can be easily computed pointwise for each observation. However, the only current development is to use resampling techniques for estimating a null distribution of the average of each measure. Rather than using the average values and bootstrapping, we propose working with functional DFFITS (DFFITS(t)) directly. We show that if the functional errors are assumed to follow a Gaussian process, DFFITS(t) is distributed uniformly as a scaled Student's t process. Then, we propose using a multivariate Student's t distributional quantile for identifying influential functional observations with DFFITS(t). Our methodology ("Theoretical'') is compared against a competing method that uses a parametric bootstrapping technique ("Bootstrapped'') for estimating the null distribution of the mean absolute value of DFFITS(t). In the simulation and case study results, we find that the Theoretical method greatly reduces the computation time, without much loss in performance as measured by accuracy (ACC), precision (PPV), and Matthew's Correlation Coefficient (MCC), than the Bootstrapped method. Furthermore, the average sensitivity of the Theoretical method is higher in all scenarios than the Bootstrapped method.
Open Access
Statistical modeling and inferences on directed networks
(Colorado State University. Libraries, 2024) Du, Wenqin, author; Zhou, Wen, advisor; Breidt, F. Jay, committee member; Meyer, Mary, committee member; Pezeshki, Ali, committee member
Network data has received great attention for elucidating comprehensive insights into nodes interactions and underlying network dynamics. This dissertation contributes new modeling tools and inference procedures to the field of network analysis, incorporating the dependence structure inherently introduced by the network data. Our first direction centers on modeling directed edges with count measurements, an area that has received limited attention in the literature. Most existing methods either assume the count edges are derived from continuous random variables or model the edge dependence by parametric distributions. In this dissertation, we develop a latent multiplicative Poisson model for directed network with count edges. Our approach directly models the edge dependence of count data by the pairwise dependence of latent errors, which are assumed to be weakly exchangeable. This assumption not only covers a variety of common network effects, but also leads to a concise representation of the error covariance. In addition, identification and inference of the mean structure, as well as the regression coefficients, depend on the errors only through their covariance, which provides substantial flexibility for our model. We propose a pseudo-likelihood based estimator for the regression coefficients that enjoys consistency and asymptotic normality. We evaluate our method by extensive numerical studies that corroborate the theory and apply our model to a food sharing network data to reveal interesting network effects that are further verified in literature. In the second project, we study the inference procedure of network dependence structures. While much research has targeted network-covariate associations and community detection, the inference of important network effects such as the reciprocity and sender-receiver effects has been largely overlooked. Testing network effects for network data or weighted directed networks is challenging due to the intricate potential edge dependence. Most existing methods are model-based, carrying strong assumptions with restricted applicability. In contrast, we present a novel, fully nonparametric framework that requires only minimal regularity assumptions. While inspired by recent developments in U-statistic literature, our work significantly broadens their scopes. Specifically, we identified and carefully addressed the indeterminate degeneracy inherent in network effect estimators - a challenge that aforementioned tools do not handle. We established Berry-Esseen type bound for the accuracy of type-I error rate control, as well as novel analysis show the minimax optimality of our test's power. Simulations highlight the superiority of our method in computation speed, accuracy, and numerical robustness relative to benchmarks. To showcase the practicality of our methods, we apply them to two real-world relationship networks, one in faculty hiring networks and the other in international trade networks. Finally, this dissertation introduces modeling strategies and corresponding methods for discerning the core-periphery (CP) structure in weighted directed networks. We adopt the signal-plus-noise model, categorizing uniform relational patterns as non-informative, by which we define the sender and receiver peripheries. Furthermore, instead of confining the core component to a specific structure, we consider it complementary to either the sender or receiver peripheries. Based on our definitions of the sender and receiver peripheries, we propose spectral algorithms to identify the CP structure in weighted directed networks. Our algorithm stands out with statistical guarantees, ensuring the identification of sender and receiver peripheries with overwhelmingly probability. Additionally, our methods scale effectively for expansive directed networks. We evaluate the proposed methods in extensive simulation studies and applied it to a faculty hiring network data, revealing captivating insights into the informative and non-informative sender/receiver behaviors.
Open Access
Test of change point versus long-range dependence in functional time series
(Colorado State University. Libraries, 2024) Meng, Xiangdong, author; Kokoszka, Piotr S., advisor; Cooley, Dan, committee member; Wang, Haonan, committee member; Miao, Hong, committee member
In scalar time series analysis, a long-range dependent (LRD) series cannot be easily distinguished from certain non-stationary models, such as the change in mean model with short-range dependent (SRD) errors. To be specific, realizations of LRD series usually have a characteristic of changing local mean if the time span taken into account is long enough, which resembles the behavior of change in mean models. Test procedure for distinguishing between these two types of model has been investigated a lot in scalar case, see e.g. Berkes et al. (2006) and Baek and Pipiras (2012) and references therein. However, no analogous test for functional observations has been developed yet, partly because of omitted methods and theory for analyzing functional time series with long-range dependence. My dissertation establishes a procedure for testing change in mean models with SRD errors against LRD processes in functional case, which is an extension of the method of Baek and Pipiras (2012). The test builds on the local Whittle (LW) (or Gaussian semiparametric) estimation of the self-similarity parameter, which is based on the estimated level 1 scores of a suitable functional residual process. Remarkably, unlike other parametric methods such as Whittle estimation, whose asymptotic properties heavily depend on validity of the underlying spectral density on the full frequency range (−π, π], LW estimation imposes mild restrictions on the spectral density only near the origin and is thus more robust to model misspecification. We shall prove that the test statistic based on LW estimation is asymptotically normally distributed under the null hypothesis and it diverges to infinity under the LRD alternative.
Open Access
Estimation for Lévy-driven CARMA processes
(Colorado State University. Libraries, 2008) Yang, Yu, author; Brockwell, Peter J., advisor; Davis, Richard A., advisor
This thesis explores parameter estimation for Lévy-driven continuous-time autoregressive moving average (CARMA) processes, using uniformly and closely spaced discrete-time observations. Specifically, we focus on developing estimation techniques and asymptotic properties of the estimators for three particular families of Lévy-driven CARMA processes. Estimation for the first family, Gaussian autoregressive processes, was developed by deriving exact conditional maximum likelihood estimators of the parameters under the assumption that the process is observed continuously. The resulting estimates are expressed in terms of stochastic integrals which are then approximated using the available closely-spaced discrete-time observations. We apply the results to both linear and non-linear autoregressive processes. For the second family, non-negative Lévy-driven Ornestein-Uhlenbeck processes, we take advantage of the non-negativity of the increments of the driving Lévy process to derive a highly efficient estimation procedure for the autoregressive coefficient when observations are available at uniformly spaced times. Asymptotic properties of the estimator are also studied and a procedure for obtaining estimates of the increments of the driving Lévy process is developed. These estimated increments are important for identifying the nature of the driving Lévy process and for estimating its parameters. For the third family, non-negative Lévy-driven CARMA processes, we estimate the coefficients by maximizing the Gaussian likelihood of the observations and discuss the asymptotic properties of the estimators. We again show how to estimate the increments of the background driving Lévy process and hence to estimate the parameters of the Lévy process itself. We assess the performance of our estimation procedures by simulations and use them to fit models to real data sets in order to determine how the theory applies in practice.
Open Access
Spatial models with applications in computer experiments
(Colorado State University. Libraries, 2008) Wang, Ke, author; Davis, Richard A., advisor; Breidt, F. Jay, advisor
Often, a deterministic computer response is modeled as a realization from a, stochastic process such as a Gaussian random field. Due to the limitation of stationary Gaussian process (GP) in inhomogeneous smoothness, we consider modeling a deterministic computer response as a realization from a stochastic heteroskedastic process (SHP), a stationary non-Gaussian process. Conditional on a latent process, the SHP has non-stationary covariance function and is a non-stationary GP. As such, the sample paths of this process exhibit greater variability and hence offer more modeling flexibility than those produced by a, traditional GP model. We use maximum likelihood for inference in the SHP model, which is complicated by the high dimensionality of the latent process. Accordingly, we develop an importance sampling method for likelihood computation and use a low-rank kriging approximation to reconstruct the latent process. Responses at unobserved locations can be predicted using empirical best predictors or by empirical best linear unbiased predictors. In addition, prediction error variances are obtained. The SHP model can be used in an active learning context, adaptively selecting new locations that provide improved estimates of the response surface. Estimation, prediction, and adaptive sampling with the SHP model are illustrated with several examples. Our spatial model can be adapted to model the first partial derivative process. The derivative process provides additional information about the shape and smoothness of the underlying deterministic function and can assist in the prediction of responses at unobserved sites. The unconditional correlation function for the derivative process presents some interesting properties, and can be used as a new class of spatial correlation functions. For parameter estimation, we propose to use a similar strategy to develop an importance sampling technique to compute the joint likelihood of responses and derivatives. The major difficulties of bringing in derivative information are the increase in the dimensionality of the latent process and the numerical problems of inverting the enlarged covariance matrix. Some possible ways to utilize this information more efficiently are proposed.
Open Access
Data mining techniques for temporal point processes applied to insurance claims data
(Colorado State University. Libraries, 2008) Iverson, Todd Ashley, author; Ben-Hur, Asa, advisor; Iyer, Hariharan K., advisor
We explore data mining on databases consisting of insurance claims information. This dissertation focuses on two major topics we considered by way of data mining procedures. One is the development of a classification rule using kernels and support vector machines. The other is the discovery of association rules using the Apriori algorithm, its extensions, as well as a new association rules technique. With regard to the first topic we address the question-can kernel methods using an SVM classifier be used to predict patients at risk of type 2 diabetes using three years of insurance claims data? We report the results of a study in which we tested the performance of new methods for data extracted from the MarketScan® database. We summarize the results of applying popular kernels, as well as new kernels constructed specifically for this task, for support vector machines on data derived from this database. We were able to predict patients at risk of type 2 diabetes with nearly 80% success when combining a number of specialized kernels. The specific form of the data, that of a timed sequence, led us to develop two new kernels inspired by dynamic time warping. The Global Time Warping (GTW) and Local Time Warping (LTW) kernels build on an existing time warping kernel by including the timing coefficients present in classical time warping, while providing a solution for the diagonal dominance present in most alignment methods. We show that the LTW kernel performs significantly better than the existing time warping kernel when the times contained relevant information. With regard to the second topic, we provide a new theorem on closed rules that could help substantially improve the time to find a specific type of rule. An insurance claims database contains codes indicating associated diagnoses and the resulting procedures for each claim. The rules that we consider are of the form diagnoses imply procedures. In addition, we introduce a new class of interesting association rules in the context of medical claims databases and illustrate their potential uses by extracting example rules from the MarketScan® database.
Open Access
Spatial processes with stochastic heteroscedasticity
(Colorado State University. Libraries, 2008) Huang, Wenying, author; Breidt, F. Jay, advisor; Davis, Richard A., advisor
Stationary Gaussian processes are widely used in spatial data modeling and analysis. Stationarity is a relatively restrictive assumption regarding spatial association. By introducing stochastic volatility into a Gaussian process, we propose a stochastic heteroscedastic process (SHP) with conditional nonstationarity. That is, conditional on a latent Gaussian process, the SHP is a Gaussian process with non-stationary covariance structure. Unconditionally, the SHP is a stationary non-Gaussian process. The realizations from SHP are versatile and can represent spatial inhomogeneities. The unconditional correlation of SHP offers a rich class of correlation functions which can also allow for a smoothed nugget effect. For maximum likelihood estimation, we propose to apply importance sampling in the likelihood calculation and latent process estimation. The importance density we constructed is of the same dimensionality as the observations. When the sample size is large, the importance sampling scheme becomes infeasible and/or inaccurate. A low-dimensional approximation model is developed to solve the numerical difficulties. We develop two spatial prediction methods: PBP (plug-in best predictor) and PBLUP (plug-in best linear unbiased predictor). Empirical results with simulated and real data show improved out-of-sample prediction performance of SHP modeling over stationary Gaussian process modeling. We extend the single-realization model to SHP model with replicates. The spatial replications are modeled as independent realizations from a SHP model conditional on a common latent process. A simulation study shows substantial improvements in parameter estimation and process prediction when replicates are available. In a example with real atmospheric deposition data, the SHP model with replicates outperforms the Gaussian process model in prediction by capturing the spatial volatilities.

Browse

Recent Submissions