Repository logo

Theses and Dissertations

Permanent URI for this collection


Recent Submissions

Now showing 1 - 20 of 81
  • ItemOpen Access
    Estimation for Lévy-driven CARMA processes
    (Colorado State University. Libraries, 2008) Yang, Yu, author; Brockwell, Peter J., advisor; Davis, Richard A., advisor
    This thesis explores parameter estimation for Lévy-driven continuous-time autoregressive moving average (CARMA) processes, using uniformly and closely spaced discrete-time observations. Specifically, we focus on developing estimation techniques and asymptotic properties of the estimators for three particular families of Lévy-driven CARMA processes. Estimation for the first family, Gaussian autoregressive processes, was developed by deriving exact conditional maximum likelihood estimators of the parameters under the assumption that the process is observed continuously. The resulting estimates are expressed in terms of stochastic integrals which are then approximated using the available closely-spaced discrete-time observations. We apply the results to both linear and non-linear autoregressive processes. For the second family, non-negative Lévy-driven Ornestein-Uhlenbeck processes, we take advantage of the non-negativity of the increments of the driving Lévy process to derive a highly efficient estimation procedure for the autoregressive coefficient when observations are available at uniformly spaced times. Asymptotic properties of the estimator are also studied and a procedure for obtaining estimates of the increments of the driving Lévy process is developed. These estimated increments are important for identifying the nature of the driving Lévy process and for estimating its parameters. For the third family, non-negative Lévy-driven CARMA processes, we estimate the coefficients by maximizing the Gaussian likelihood of the observations and discuss the asymptotic properties of the estimators. We again show how to estimate the increments of the background driving Lévy process and hence to estimate the parameters of the Lévy process itself. We assess the performance of our estimation procedures by simulations and use them to fit models to real data sets in order to determine how the theory applies in practice.
  • ItemOpen Access
    Spatial models with applications in computer experiments
    (Colorado State University. Libraries, 2008) Wang, Ke, author; Davis, Richard A., advisor; Breidt, F. Jay, advisor
    Often, a deterministic computer response is modeled as a realization from a, stochastic process such as a Gaussian random field. Due to the limitation of stationary Gaussian process (GP) in inhomogeneous smoothness, we consider modeling a deterministic computer response as a realization from a stochastic heteroskedastic process (SHP), a stationary non-Gaussian process. Conditional on a latent process, the SHP has non-stationary covariance function and is a non-stationary GP. As such, the sample paths of this process exhibit greater variability and hence offer more modeling flexibility than those produced by a, traditional GP model. We use maximum likelihood for inference in the SHP model, which is complicated by the high dimensionality of the latent process. Accordingly, we develop an importance sampling method for likelihood computation and use a low-rank kriging approximation to reconstruct the latent process. Responses at unobserved locations can be predicted using empirical best predictors or by empirical best linear unbiased predictors. In addition, prediction error variances are obtained. The SHP model can be used in an active learning context, adaptively selecting new locations that provide improved estimates of the response surface. Estimation, prediction, and adaptive sampling with the SHP model are illustrated with several examples. Our spatial model can be adapted to model the first partial derivative process. The derivative process provides additional information about the shape and smoothness of the underlying deterministic function and can assist in the prediction of responses at unobserved sites. The unconditional correlation function for the derivative process presents some interesting properties, and can be used as a new class of spatial correlation functions. For parameter estimation, we propose to use a similar strategy to develop an importance sampling technique to compute the joint likelihood of responses and derivatives. The major difficulties of bringing in derivative information are the increase in the dimensionality of the latent process and the numerical problems of inverting the enlarged covariance matrix. Some possible ways to utilize this information more efficiently are proposed.
  • ItemOpen Access
    Data mining techniques for temporal point processes applied to insurance claims data
    (Colorado State University. Libraries, 2008) Iverson, Todd Ashley, author; Ben-Hur, Asa, advisor; Iyer, Hariharan K., advisor
    We explore data mining on databases consisting of insurance claims information. This dissertation focuses on two major topics we considered by way of data mining procedures. One is the development of a classification rule using kernels and support vector machines. The other is the discovery of association rules using the Apriori algorithm, its extensions, as well as a new association rules technique. With regard to the first topic we address the question-can kernel methods using an SVM classifier be used to predict patients at risk of type 2 diabetes using three years of insurance claims data? We report the results of a study in which we tested the performance of new methods for data extracted from the MarketScan® database. We summarize the results of applying popular kernels, as well as new kernels constructed specifically for this task, for support vector machines on data derived from this database. We were able to predict patients at risk of type 2 diabetes with nearly 80% success when combining a number of specialized kernels. The specific form of the data, that of a timed sequence, led us to develop two new kernels inspired by dynamic time warping. The Global Time Warping (GTW) and Local Time Warping (LTW) kernels build on an existing time warping kernel by including the timing coefficients present in classical time warping, while providing a solution for the diagonal dominance present in most alignment methods. We show that the LTW kernel performs significantly better than the existing time warping kernel when the times contained relevant information. With regard to the second topic, we provide a new theorem on closed rules that could help substantially improve the time to find a specific type of rule. An insurance claims database contains codes indicating associated diagnoses and the resulting procedures for each claim. The rules that we consider are of the form diagnoses imply procedures. In addition, we introduce a new class of interesting association rules in the context of medical claims databases and illustrate their potential uses by extracting example rules from the MarketScan® database.
  • ItemOpen Access
    Spatial processes with stochastic heteroscedasticity
    (Colorado State University. Libraries, 2008) Huang, Wenying, author; Breidt, F. Jay, advisor; Davis, Richard A., advisor
    Stationary Gaussian processes are widely used in spatial data modeling and analysis. Stationarity is a relatively restrictive assumption regarding spatial association. By introducing stochastic volatility into a Gaussian process, we propose a stochastic heteroscedastic process (SHP) with conditional nonstationarity. That is, conditional on a latent Gaussian process, the SHP is a Gaussian process with non-stationary covariance structure. Unconditionally, the SHP is a stationary non-Gaussian process. The realizations from SHP are versatile and can represent spatial inhomogeneities. The unconditional correlation of SHP offers a rich class of correlation functions which can also allow for a smoothed nugget effect. For maximum likelihood estimation, we propose to apply importance sampling in the likelihood calculation and latent process estimation. The importance density we constructed is of the same dimensionality as the observations. When the sample size is large, the importance sampling scheme becomes infeasible and/or inaccurate. A low-dimensional approximation model is developed to solve the numerical difficulties. We develop two spatial prediction methods: PBP (plug-in best predictor) and PBLUP (plug-in best linear unbiased predictor). Empirical results with simulated and real data show improved out-of-sample prediction performance of SHP modeling over stationary Gaussian process modeling. We extend the single-realization model to SHP model with replicates. The spatial replications are modeled as independent realizations from a SHP model conditional on a common latent process. A simulation study shows substantial improvements in parameter estimation and process prediction when replicates are available. In a example with real atmospheric deposition data, the SHP model with replicates outperforms the Gaussian process model in prediction by capturing the spatial volatilities.
  • ItemOpen Access
    Estimation of structural breaks in nonstationary time series
    (Colorado State University. Libraries, 2008) Hancock, Stacey, author; Davis, Richard A., advisor; Iyer, Hari K., advisor
    Many time series exhibit structural breaks in a variety of ways, the most obvious being a mean level shift. In this case, the mean level of the process is constant over periods of time, jumping to different levels at times called change-points. These jumps may be due to outside influences such as changes in government policy or manufacturing regulations. Structural breaks may also be a result of changes in variability or changes in the spectrum of the process. The goal of this research is to estimate where these structural breaks occur and to provide a model for the data within each stationary segment. The program Auto-PARM (Automatic Piecewise AutoRegressive Modeling procedure), developed by Davis, Lee, and Rodriguez-Yam (2006), uses the minimum description length principle to estimate the number and locations of change-points in a time series by fitting autoregressive models to each segment. The research in this dissertation shows that when the true underlying model is segmented autoregressive, the estimates obtained by Auto-PARM are consistent. Under a more general time series model exhibiting structural breaks, Auto-PARM's estimates of the number and locations of change-points are again consistent, and the segmented autoregressive model provides a useful approximation to the true process. Weak consistency proofs are given, as well as simulation results when the true process is not autoregressive. An example of the application of Auto-PARM as well as a source of inspiration for this research is the analysis of National Park Service sound data. This data was collected by the National Park Service over four years in around twenty of the National Parks by setting recording devices in several sites throughout the parks. The goal of the project is to estimate the amount of manmade sound in the National Parks. Though the project is in its initial stages, Auto-PARM provides a promising method for analyzing sound data by breaking the sound waves into pseudo-stationary pieces. Once the sound data have been broken into pieces, a classification technique can be applied to determine the type of sound in each segment.
  • ItemOpen Access
    Confidence regions for level curves and a limit theorem for the maxima of Gaussian random fields
    (Colorado State University. Libraries, 2009) French, Joshua, author; Davis, Richard A., advisor
    One of the most common display tools used to represent spatial data is the contour plot. Informally, a contour plot is created by taking a "slice" of a three-dimensional surface at a certain level of the response variable and projecting the slice onto the two-dimensional coordinate-plane. The "slice" at each level is known as a level curve.
  • ItemOpen Access
    Applications of generalized fiducial inference
    (Colorado State University. Libraries, 2009) E, Lidong, author; Iyer, Hariharan K., advisor
    Hannig (2008) generalized Fisher's fiducial argument and obtained a fiducial recipe for interval estimation that is applicable in virtually any situation. In this dissertation research, we apply this fiducial recipe and fiducial generalized pivotal quantity to make inference in four practical problems. The list of problems we consider is (a) confidence intervals for variance components in an unbalanced two-component normal mixed linear model; (b) confidence intervals for median lethal dose (LD50) in bioassay experiments; (c) confidence intervals for the concordance correlation coefficient (CCC) in method comparison; (d) simultaneous confidence intervals for ratios of means of Lognormal distributions. For all the fiducial generalized confidence intervals (a)-(d), we conducted a simulation study to evaluate their performance and compare them with other competing confidence interval procedures from the literature. We also proved that the intervals (a) and (d) have asymptotically exact frequentist coverage.
  • ItemOpen Access
    Penalized estimation for sample surveys in the presence of auxiliary variables
    (Colorado State University. Libraries, 2008) Delorey, Mark J., author; Breidt, F. Jay, advisor
    In conducting sample surveys, time and financial resources can be limited but research questions are wide and varied. Thus, methods for analysis must make the best use of whatever data are available and produce results that address a variety of needs. Motivation for this research comes from surveys of aquatic resources, in which sample sizes are small to moderate, but auxiliary information is available to supplement measured survey responses. The problems of survey estimation are considered, tied together in their use of constrained/penalized estimation techniques for combining information from the auxiliary information and the responses of interest. We study a small area problem with the goal of obtaining a good ensemble estimate, that is, a collection of estimates for individual small areas that collectively give a good estimate of the overall distribution function across small areas. Often, estimators that are good for one purpose may not be good for others. For example, estimation of the distribution function itself (as in Cordy and Thomas, 1997) can address questions of variability and extremes but does not provide individual estimators of the small areas, nor is it appropriate when auxiliary information can be made of use. Bayes estimators are good individual estimators in terms of mean squared error but are not variable enough to represent ensemble traits (Ghosh, 1992). An algorithm that extends the constrained Bayes (CB) methods of Louis (1984) and Ghosh (1992) for use in a model with a general covariance matrix is presented. This algorithm produces estimators with similar properties as (CB), and we refer to this method as general constrained Bayes (GCB). The ensemble GCB estimator is asymptotically unbiased for the posterior mean of the empirical distribution function (edf). The ensemble properties of transformed GCB estimates are investigated to determine if the desirable ensemble characteristics displayed by the GCB estimator are preserved under such transformations. The GCB algorithm is then applied to complex models such as conditional autoregressive spatial models and to penalized spline models. Illustrative examples include the estimation of lip cancer risk, mean water acidity, and rates of change in water acidity. We also study a moderate area problem in which the goal is to derive a set of survey weights that can be applied to each study variable with reasonable predictive results. Zheng and Little (2003) use penalized spline regression in a model-based approach for finite population estimation in a two-stage sample when predictor variables are available. Breidt et al. (2005) propose a class of model-assisted estimators based on penalized spline regression in single stage sampling. Because unbiasedness of the model-based estimator requires that the model be correctly specified, we look at extending model-assisted estimation to the two-stage case. By calibrating the degrees of freedom of the smooth to the most important study variables, a set of weights can be obtained that produce design consistent estimators for all study variables. The model-assisted estimator is compared to other estimators in a simulation study. Results from the simulation study show that the model-assisted estimator is comparable to other estimators when the model is correctly specified and generally superior when the model is incorrectly specified.
  • ItemOpen Access
    State-space models for stream networks
    (Colorado State University. Libraries, 2007) Coar, William J., author; Breidt, F. Jay, advisor
    The natural branching that occurs in a stream network, in which two upstream reaches merge to create a new downstream reach, generates a tree structure. Furthermore, because of the natural flow of water in a stream network, characteristics of a downstream reach may depend on characteristics of upstream reaches. Since the flow of water from reach to reach provides a natural time-like ordering throughout the stream network, we propose a state-space model to describe the spatial dependence in this tree-like structure with ordering based on flow. Developing a state-space formulation permits the use of the well known Kalman recursions. Variations of the Kalman Filter and Smoother are derived for the tree-structured state-space model, which allows recursive estimation of unobserved states and prediction of missing observations on the network, as well as computation of the Gaussian likelihood, even when the data are incomplete. To reduce the computational burden that may be associated with optimization of this exact likelihood, a version of the expectation-maximization (EM) algorithm is presented that uses the Kalman Smoother to fill in missing values in the E-step, and maximizes the Gaussian likelihood for the completed dataset in the M-step. Several forms of dependence for discrete processes on a stream network are considered, such as network analogues of the autoregressive-moving average model and stochastic trend models. Network parallels for first and second differences in time-series are defined, which allow for definition of a spline smoother on a stream network through a special case of a local linear trend model. We have taken the approach of modeling a discrete process, which we see as a building block to more appropriate yet more complicated models. Adaptation of this state-space model and Kalman prediction equations to allow for more complicated forms of spatial and perhaps temporal dependence is a potential area of future research. Other possible directions for future research are non-Gaussian and nonlinear error structures, model selection, and properties of estimators.
  • ItemOpen Access
    Statistical modeling with COGARCH(p,q) processes
    (Colorado State University. Libraries, 2009) Chadraa, Erdenebaatar, author; Brockwell, Peter J., advisor
    In this paper, a family of continuous time GARCH processes, generalizing the COGARCH(1, 1) process of Klüppelberg, et al. (2004), is introduced and studied. The resulting COGARCH(p,q) processes, q ≥ p ≥ 1, exhibit many of the characteristic features of observed financial time series, while their corresponding volatility and squared increment processes display a broader range of autocorrelation structures than those of the COGARCH(1, 1) process. We establish sufficient conditions for the existence of a strictly stationary non-negative solution of the equations for the volatility process and, under conditions which ensure the finiteness of the required moments, determine the autocorrelation functions of both the volatility and squared increment processes. The volatility process is found to have the autocorrelation function of a continuous-time ARMA process while the squared increment process has the autocorrelation function of an ARMA process.
  • ItemOpen Access
    Model selection based on expected squared Hellinger distance
    (Colorado State University. Libraries, 2007) Cao, Xiaofan, author; Iyer, Hariharan K., advisor; Wang, Haonan, advisor
    This dissertation is motivated by a general model selection problem such that the true model is unknown and one or more approximating parametric families of models are given along with strategies for estimating the parameters using data. We develop model selection methods based on Hellinger distance that can be applied to a wide range of modeling problems without posing the typical assumptions for the true model to be within the approximating families or to come from a particular parametric family. We propose two estimators for the expected squared Hellinger distance as the model selection criteria.
  • ItemOpen Access
    Bayesian models and streaming samplers for complex data with application to network regression and record linkage
    (Colorado State University. Libraries, 2023) Taylor, Ian M., author; Kaplan, Andee, advisor; Fosdick, Bailey K., advisor; Keller, Kayleigh P., committee member; Koslovsky, Matthew D., committee member; van Leeuwen, Peter Jan, committee member
    Real-world statistical problems often feature complex data due to either the structure of the data itself or the methods used to collect the data. In this dissertation, we present three methods for the analysis of specific complex data: Restricted Network Regression, Streaming Record Linkage, and Generative Filtering. Network data contain observations about the relationships between entities. Applying mixed models to network data can be problematic when the primary interest is estimating unconditional regression coefficients and some covariates are exactly or nearly in the vector space of node-level effects. We introduce the Restricted Network Regression model that removes the collinearity between fixed and random effects in network regression by orthogonalizing the random effects against the covariates. We discuss the change in the interpretation of the regression coefficients in Restricted Network Regression and analytically characterize the effect of Restricted Network Regression on the regression coefficients for continuous response data. We show through simulation on continuous and binary data that Restricted Network Regression mitigates, but does not alleviate, network confounding. We apply the Restricted Network Regression model in an analysis of 2015 Eurovision Song Contest voting data and show how the choice of regression model affects inference. Data that are collected from multiple noisy sources pose challenges to analysis due to potential errors and duplicates. Record linkage is the task of combining records from multiple files which refer to overlapping sets of entities when there is no unique identifying field. In streaming record linkage, files arrive sequentially in time and estimates of links are updated after the arrival of each file. We approach streaming record linkage from a Bayesian perspective with estimates calculated from posterior samples of parameters, and present methods for updating link estimates after the arrival of a new file that are faster than fitting a joint model with each new data file. We generalize a two-file Bayesian Fellegi-Sunter model to the multi-file case and propose two methods to perform streaming updates. We examine the effect of prior distribution on the resulting linkage accuracy as well as the computational trade-offs between the methods when compared to a Gibbs sampler through simulated and real-world survey panel data. We achieve near-equivalent posterior inference at a small fraction of the compute time. Motivated by the streaming data setting and streaming record linkage, we propose a more general sampling method for Bayesian models for streaming data. In the streaming data setting, Bayesian models can employ recursive updates, incorporating each new batch of data into the model parameters' posterior distribution. Filtering methods are currently used to perform these updates efficiently, however, they suffer from eventual degradation as the number of unique values within the filtered samples decreases. We propose Generative Filtering, a method for efficiently performing recursive Bayesian updates in the streaming setting. Generative Filtering retains the speed of a filtering method while using parallel updates to avoid degenerate distributions after repeated applications. We derive rates of convergence for Generative Filtering and conditions for the use of sufficient statistics instead of storing all past data. We investigate properties of Generative Filtering through simulation and ecological species count data.
  • ItemOpen Access
    Integrated statistical models in ecology
    (Colorado State University. Libraries, 2023) Van Ee, Justin, author; Hooten, Mevin, advisor; Koslovsky, Matthew, advisor; Keller, Kayleigh, committee member; Kaplan, Andee, committee member; Bailey, Larissa, committee member
    The number of endangered and vulnerable species continues to grow globally as a result of habitat destruction, overharvesting, invasive species, and climate change. Understanding the drivers of population decline is pivotal for informing species conservation. Many datasets collected are restricted to a limited portion of the species range, may not include observations of other organisms in the community, or lack temporal breadth. When analyzed independently, these datasets often overlook drivers of population decline, muddle community responses to ecological threats, and poorly predict population trajectories. Over the last decade, thanks to efforts like The Long Term Ecological Research Network and National Ecological Observatory Network, citizen science surveys, and technological advances, ecological datasets that provide insights about collections of organisms or multiple characteristics of the same organism have become prevalent. The conglomerate of datasets has the potential to provide novel insights, improve predictive performance, and disentangle the contributions of confounded factors, but specifying joint models that assimilate all the available data sources is both intellectually daunting and computationally prohibitive. I develop methodology for specifying computationally efficient integrated models. I discuss datasets frequently collected in ecology, objectives common to many analyses, and the methodological challenges associated with specifying joint models in these contexts. I introduce a suite of model building and computational techniques I used to facilitate inference in three applied analyses of ecological data. In a case study of the joint mammalian response to the bark beetle epidemic in Colorado, I describe a restricted regression approach to deconfounding the effects of environmental factors and community structure on species distributions. I highlight that fitting certain joint species distribution models in a restricted parameterization improves sampling efficiency. To improve abundance estimates for a federally protected species, I specify an integrated model for analyzing independent aerial and ground surveys. I use a Markov melding approach to facilitate posterior inference and construct the joint distribution implied by the prior information, assumptions, and data expressed across a chain of submodels. I extend the integrated model by assimilating additional demographic surveys of the species that allow abundance estimates to be linked to annual variability in population vital rates. To reduce computation time, both models are fit using a multi-stage Markov chain Monte Carlo algorithm with parallelization. In each applied analysis, I uncover associations that would have been overlooked had the datasets been analyzed independently and improve predictive performance relative to models fit to individual datasets.
  • ItemOpen Access
    Statistical models for COVID-19 infection fatality rates and diagnostic test data
    (Colorado State University. Libraries, 2023) Pugh, Sierra, author; Wilson, Ander, advisor; Fosdick, Bailey K., advisor; Keller, Kayleigh, committee member; Meyer, Mary, committee member; Gutilla, Molly, committee member
    The COVID-19 pandemic has had devastating impacts worldwide. Early in the pandemic, little was known about the emerging disease. To inform policy, it was essential to develop data science tools to inform public health policy and interventions. We developed methods to fill three gaps in the literature. A first key task for scientists at the start of the pandemic was to develop diagnostic tests to classify an individual's disease status as positive or negative and to estimate community prevalence. Researchers rapidly developed diagnostic tests, yet there was a lack of guidance on how to select a cutoff to classify positive and negative test results for COVID-19 antibody tests developed with limited numbers of controls with known disease status. We propose selecting a cutoff using extreme value theory and compared this method to existing methods through a data analysis and simulation study. Second, there lacked a cohesive method for estimating the infection fatality rate (IFR) of COVID-19 that fully accounted for uncertainty in the fatality data, seroprevalence study data, and antibody test characteristics. We developed a Bayesian model to jointly model these data to fully account for the many sources of uncertainty. A third challenge is providing information that can be used to compare seroprevalence and IFR across locations to best allocate resources and target public health interventions. It is particularly important to account for differences in age-distributions when comparing across locations as age is a well-established risk factor for COVID-19 mortality. There is a lack of methods for estimating the seroprevalence and IFR as continuous functions of age, while adequately accounting for uncertainty. We present a Bayesian hierarchical model that jointly estimates seroprevalence and IFR as continuous functions of age, sharing information across locations to improve identifiability. We use this model to estimate seroprevalence and IFR in 26 developing country locations.
  • ItemOpen Access
    Methodology in air pollution epidemiology for large-scale exposure prediction and environmental trials with non-compliance
    (Colorado State University. Libraries, 2023) Ryder, Nathan, author; Keller, Kayleigh, advisor; Wilson, Ander, committee member; Cooley, Daniel, committee member; Neophytou, Andreas, committee member
    Exposure to airborne pollutants, both long- and short-term, can lead to harmful respiratory, cardiovascular, and cardiometabolic outcomes. Multiple challenges arise in the study of relationships between ambient air pollution and health outcomes. For example, in large observational cohort studies, individual measurements are not feasible so researchers use small sets of pollutant concentration measurements to predict subject-level exposures. As a second example, inconsistent compliance of subjects to their assigned treatments can affect results from randomized controlled trials of environmental interventions. In this dissertation, we present methods to address these challenges. We develop a penalized regression model that can predict particulate matter exposures in space and time, including penalties to discourage overfitting and encourage smoothness in time. This model is more accurate than spatial-only and spatiotemporal universal kriging (UK) models when the exposures are missing in a regular (semi-daily) pattern. Our penalized regression model is also faster than both UK models, allowing the use of bootstrap methods to account for measurement error bias and monitor site selection in a two-stage health model. We introduce methods to estimate causal effects in a longitudinal setting by latent "at-the-time" principal strata. We implement an array of linear mixed models on data subsets, each with weights derived from principal scores. In addition, we estimate the same stratified causal effects with a Bayesian mixture model. The weighted linear mixed models outperform the Bayesian mixture model and an existing single-measure principal scores method in all simulation scenarios, and are the only method to produce a significant estimate for a causal effect of treatment assignment by strata when applied to a Honduran cookstove intervention study. Finally, we extend the "at-the-time" longitudinal principal stratification framework to a setting where continuous exposure measurements are the post-treatment variable by which the latent strata are defined. We categorize the continuous exposures to a binary variable in order to use our previous method of weighted linear mixed models. We also extend an existing Bayesian approach to the longitudinal setting, which does not require categorization of the exposures. The previous weighted linear mixed model and single-measure principal scores methods are negatively biased when applied to simulated samples, while the Bayesian approach produces the lowest RMSE and bias near zero. The Bayesian approach, when applied to the same Honduran cookstove intervention study as before, does not find a significant estimate for the causal effect of treatment assignment by strata.
  • ItemOpen Access
    Application of statistical and deep learning methods to power grids
    (Colorado State University. Libraries, 2023) Rimkus, Mantautas, author; Kokoszka, Piotr, advisor; Wang, Haonan, advisor; Nielsen, Aaron, committee member; Cooley, Dan, committee member; Chen, Haonan, committee member
    The structure of power flows in transmission grids is evolving and is likely to change significantly in the coming years due to the rapid growth of renewable energy generation that introduces randomness and bidirectional power flows. Another transformative aspect is the increasing penetration of various smart-meter technologies. Inexpensive measurement devices can be placed at practically any component of the grid. As a result, traditional fault detection methods may no longer be sufficient. Consequently, there is a growing interest in developing new methods to detect power grid faults. Using model data, we first propose a two-stage procedure for detecting a fault in a regional power grid. In the first stage, a fault is detected in real time. In the second stage, the faulted line is identified with a negligible delay. The approach uses only the voltage modulus measured at buses (nodes of the grid) as the input. Our method does not require prior knowledge of the fault type. We further explore fault detection based on high-frequency data streams that are becoming available in modern power grids. Our approach can be treated as an online (sequential) change point monitoring methodology. However, due to the mostly unexplored and very nonstandard structure of high-frequency power grid streaming data, substantial new statistical development is required to make this methodology practically applicable. The work includes development of scalar detectors based on multichannel data streams, determination of data-driven alarm thresholds and investigation of the performance and robustness of the new tools. Due to a reasonably large database of faults, we can calculate frequencies of false and correct fault signals, and recommend implementations that optimize these empirical success rates. Next, we extend our proposed method for fault localization in a regional grid for scenarios where partial observability limits the available data. While classification methods have been proposed for fault localization, their effectiveness depends on the availability of labeled data, which is often impractical in real-life situations. Our approach bridges the gap between partial and full observability of the power grid. We develop efficient fault localization methods that can operate effectively even when only a subset of power grid bus data is available. This work contributes to the research area of fault diagnosis in scenarios where the number of available phasor measurement unit devices is smaller than the number of buses in the grid. We propose using Graph Neural Networks in combination with statistical fault localization methods to localize faults in a regional power grid with minimal available data. Our contribution to the field of fault localization aims to enable the adoption of effective fault localization methods for future power grids.
  • ItemOpen Access
    Causality and clustering in complex settings
    (Colorado State University. Libraries, 2023) Gibbs, Connor P., author; Keller, Kayleigh, advisor; Fosdick, Bailey, advisor; Koslovsky, Matthew, committee member; Kaplan, Andee, committee member; Anderson, Brooke, committee member
    Causality and clustering are at the forefront of many problems in statistics. In this dissertation, we present new methods and approaches for drawing causal inference with temporally dependent units and clustering nodes in heterogeneous networks. To begin, we investigate the causal effect of a timeout at stopping an opposing team's run in the National Basketball Association (NBA). After formalizing the notion of a run in the NBA and in light of the temporal dependence among runs, we define the units under study with careful consideration of the stable unit-treatment-value assumption pertinent to the Rubin causal model. After introducing a novel, interpretable outcome based on the score difference, we conclude that while comebacks frequently occur after a run, it is slightly disadvantageous to call a timeout during a run by the opposing team. Further, we demonstrate that the magnitude of this effect varies by franchise, lending clarity to an oft-debated topic among sports' fans. Following, we represent the known relationships among and between genetic variants and phenotypic abnormalities as a heterogeneous network and introduce a novel analytic pipeline to identify clusters containing undiscovered gene to phenotype relations (ICCUR) from the network. ICCUR identifies, scores, and ranks small heterogeneous clusters according to their potential for future discovery in a large temporal biological network. We train an ensemble model of boosted regression trees to predict clusters' potential for future discovery using observable cluster features, and show the resulting clusters contain significantly more undiscovered gene to phenotype relations than expected by chance. To demonstrate its use as a diagnostic aid, we apply the results of the ICCUR pipeline to real, undiagnosed patients with rare diseases, identifying clusters containing patients' co-occurring yet otherwise unconnected genotypic and phenotypic information, some connections which have since been validated by human curation. Motivated by ICCUR and its application, we introduce a novel method called ECoHeN (pronounced "eco-hen") to extract communities from heterogeneous networks in a statistically meaningful way. Using a heterogeneous configuration model as a reference distribution, ECoHeN identifies communities that are significantly more densely connected than expected given the node types and connectivity of its membership without imposing constraints on the type composition of the extracted communities. The ECoHeN algorithm identifies communities one at a time through a dynamic set of iterative updating rules and is guaranteed to converge. To our knowledge this is the first discovery method that distinguishes and identifies both homogeneous and heterogeneous, possibly overlapping, community structure in a network. We demonstrate the performance of ECoHeN through simulation and in application to a political blogs network to identify collections of blogs which reference one another more than expected considering the ideology of its' members. Along with small partisan communities, we demonstrate ECoHeN's ability to identify a large, bipartisan community undetectable by canonical community detection methods and denser than modern, competing methods.
  • ItemOpen Access
    Randomization tests for experiments embedded in complex surveys
    (Colorado State University. Libraries, 2022) Brown, David A., author; Breidt, F. Jay, advisor; Sharp, Julia, committee member; Zhou, Tianjian, committee member; Ogle, Stephen, committee member
    Embedding experiments in complex surveys has become increasingly important. For scientific questions, such embedding allows researchers to take advantage of both the internal validity of controlled experiments and the external validity of probability-based samples of a population. Within survey statistics, declining response rates have led to the development of new methods, known as adaptive and responsive survey designs, that try to increase or maintain response rates without negatively impacting survey quality. Such methodologies are assessed experimentally. Examples include a series of embedded experiments in the 2019 Triennial Community Health Survey (TCHS), conducted by the Health District of Northern Larimer County in collaboration with the Department of Statistics at Colorado State University, to determine the effects of monetary incentives, targeted mailing of reminders, and double-stuffed envelopes (including both English and Spanish versions of the survey) on response rates, cost, and representativeness of the sample. This dissertation develops methodology and theory of randomization-based tests embedded in complex surveys, assesses the methodology via simulation, and applies the methods to data from the 2019 TCHS. An important consideration in experiments to increase response rates is the overall balance of the sample, because higher overall response might still underrepresent important groups. There have been advances in recent years on methods to assess the representativeness of samples, including application of the dissimilarity index (DI) to help evaluate the representativeness of a sample under the different conditions in an incentive experiment (Biemer et al. [2018]). We develop theory and methodology for design-based inference for the DI when used in a complex survey. Simulation studies show that the linearization method has good properties, with good confidence interval coverage even in cases when the true DI is close to zero, even though point estimates may be biased. We then develop a class of randomization tests for evaluating experiments embedded in complex surveys. We consider a general parametric contrast, estimated using the design-weighted Narain-Horvitz-Thompson (NHT) approach, in either a completely randomized design or a randomized complete block design embedded in a complex survey. We derive asymptotic normal approximations for the randomization distribution of a general contrast, from which critical values can be derived for testing the null hypothesis that the contrast is zero. The asymptotic results are conditioned on the complex sample, but we include results showing that, under mild conditions, the inference extends to the finite population. Further, we develop asymptotic power properties of the tests under moderate conditions. Through simulation, we illustrate asymptotic properties of the randomization tests and compare the normal approximations of the randomization tests with corresponding Monte Carlo tests, with a design-based test developed by van den Brakel, and with randomization tests developed by Fisher-Pitman-Welch and Neyman. The randomization approach generalizes broadly to other kinds of embedded experimental designs and null hypothesis testing problems, for very general survey designs. The randomization approach is then extended from NHT estimators to generalized regression estimators that incorporate auxiliary information, and from linear contrasts to comparisons of nonlinear functions.
  • ItemOpen Access
    The pooling of prior distributions via logarithmic and supra-Bayesian methods with application to Bayesian inference in deterministic simulation models
    (Colorado State University. Libraries, 1998) Roback, Paul J., author; Givens, Geof, advisor; Hoeting, Jennifer, committee member; Howe, Adele, committee member; Tweedie, Richard, committee member
    We consider Bayesian inference when priors and likelihoods are both available for inputs and outputs of a deterministic simulation model. Deterministic simulation models are used frequently by scientists to describe natural systems, and the Bayesian framework provides a natural vehicle for incorporating uncertainty in a deterministic model. The problem of making inference about parameters in deterministic simulation models is fundamentally related to the issue of aggregating (i. e. pooling) expert opinion. Alternative strategies for aggregation are surveyed and four approaches are discussed in detail- logarithmic pooling, linear pooling, French-Lindley supra-Bayesian pooling, and Lindley-Winkler supra-Bayesian pooling. The four pooling approaches are compared with respect to three suitability factors-theoretical properties, performance in examples, and the selection and sensitivity of hyperparameters or weightings incorporated in each method and the logarithmic pool is found to be the most appropriate pooling approach when combining exp rt opinions in the context of deterministic simulation models. We develop an adaptive algorithm for estimating log pooled priors for parameters in deterministic simulation models. Our adaptive estimation approach relies on importance sampling methods, density estimation techniques for which we numerically approximate the Jacobian, and nearest neighbor approximations in cases in which the model is noninvertible. This adaptive approach is compared to a nonadaptive approach over several examples ranging from a relatively simple R1 → R1 example with normally distributed priors and a linear deterministic model, to a relatively complex R2 → R2 example based on the bowhead whale population model. In each case, our adaptive approach leads to better and more efficient estimates of the log pooled prior than the nonadaptive estimation algorithm. Finally, we extend our inferential ideas to a higher-dimensional, realistic model for AIDS transmission. Several unique contributions to the statistical discipline are contained in this dissertation, including: 1. the application of logarithmic pooling to inference in deterministic simulation models; 2. the algorithm for estimating log pooled priors using an adaptive strategy; 3. the Jacobian-based approach to density estimation in this context, especially in higher dimensions; 4. the extension of the French-Lindley supra-Bayesian methodology to continuous parameters; 5. the extension of the Lindley-Winkler supra-Bayesian methodology to multivariate parameters; and, 6. the proofs and illustrations of the failure of Relative Propensity Consistency under the French-Lindley supra-Bayesian approach.
  • ItemOpen Access
    Transformed-linear models for time series extremes
    (Colorado State University. Libraries, 2022) Mhatre, Nehali, author; Cooley, Daniel, advisor; Kokoszka, Piotr, committee member; Shaby, Benjamin, committee member; Wang, Tianyang, committee member
    In order to capture the dependence in the upper tail of a time series, we develop nonnegative regularly-varying time series models that are constructed similarly to classical non-extreme ARMA models. Rather than fully characterizing tail dependence of the time series, we define the concept of weak tail stationarity which allows us to describe a regularly-varying time series through the tail pairwise dependence function (TPDF) which is a measure of pairwise extremal dependencies. We state consistency requirements among the finite-dimensional collections of the elements of a regularly-varying time series and show that the TPDF's value does not depend on the dimension being considered. So that our models take nonnegative values, we use transformed-linear operations. We show existence and stationarity of these models, and develop their properties such as the model TPDF's. Motivated by investigating conditions conducive to the spread of wildfires, we fit models to hourly windspeed data using a preliminary estimation method and find that the fitted transformed-linear models produce better estimates of upper tail quantities than traditional ARMA models or than classical linear regularly-varying models. The innovations algorithm is a classical recursive algorithm used in time series analysis. We develop an analogous transformed-linear innovations algorithm for our time series models that allows us to perform prediction which is fundamental to any time series analysis. The transformed-linear innovations algorithm also enables us to estimate parameters of the transformed-linear regularly-varying moving average models, thus providing a tool for modeling. We construct an inner product space of transformed-linear combinations of nonnegative regularly-varying random variables and prove its link to a Hilbert space which allows us to employ the projection theorem. We develop the transformed-linear innovations algorithm using the properties of the projection theorem. Turning our attention to the class of MA(∞) models, we talk about estimation and also show that this class of models is dense in the class of possible TPDFs. We also develop an extremes analogue of the classical Wold decomposition. Simulation study shows that our class of models provides adequate models for the GARCH and another model outside our class of models. The transformed-linear innovations algorithm gives us the best prediction and we also develop prediction intervals based on the geometry of regular variation. Simulation study shows that we obtain good coverage rates for prediction errors. We perform modeling and prediction for the hourly windspeed data by applying the innovations algorithm to the estimated TPDF.