Browsing by Author "Fosdick, Bailey K., advisor"
Now showing 1 - 3 of 3
Results Per Page
Sort Options
Item Open Access Bayesian models and streaming samplers for complex data with application to network regression and record linkage(Colorado State University. Libraries, 2023) Taylor, Ian M., author; Kaplan, Andee, advisor; Fosdick, Bailey K., advisor; Keller, Kayleigh P., committee member; Koslovsky, Matthew D., committee member; van Leeuwen, Peter Jan, committee memberReal-world statistical problems often feature complex data due to either the structure of the data itself or the methods used to collect the data. In this dissertation, we present three methods for the analysis of specific complex data: Restricted Network Regression, Streaming Record Linkage, and Generative Filtering. Network data contain observations about the relationships between entities. Applying mixed models to network data can be problematic when the primary interest is estimating unconditional regression coefficients and some covariates are exactly or nearly in the vector space of node-level effects. We introduce the Restricted Network Regression model that removes the collinearity between fixed and random effects in network regression by orthogonalizing the random effects against the covariates. We discuss the change in the interpretation of the regression coefficients in Restricted Network Regression and analytically characterize the effect of Restricted Network Regression on the regression coefficients for continuous response data. We show through simulation on continuous and binary data that Restricted Network Regression mitigates, but does not alleviate, network confounding. We apply the Restricted Network Regression model in an analysis of 2015 Eurovision Song Contest voting data and show how the choice of regression model affects inference. Data that are collected from multiple noisy sources pose challenges to analysis due to potential errors and duplicates. Record linkage is the task of combining records from multiple files which refer to overlapping sets of entities when there is no unique identifying field. In streaming record linkage, files arrive sequentially in time and estimates of links are updated after the arrival of each file. We approach streaming record linkage from a Bayesian perspective with estimates calculated from posterior samples of parameters, and present methods for updating link estimates after the arrival of a new file that are faster than fitting a joint model with each new data file. We generalize a two-file Bayesian Fellegi-Sunter model to the multi-file case and propose two methods to perform streaming updates. We examine the effect of prior distribution on the resulting linkage accuracy as well as the computational trade-offs between the methods when compared to a Gibbs sampler through simulated and real-world survey panel data. We achieve near-equivalent posterior inference at a small fraction of the compute time. Motivated by the streaming data setting and streaming record linkage, we propose a more general sampling method for Bayesian models for streaming data. In the streaming data setting, Bayesian models can employ recursive updates, incorporating each new batch of data into the model parameters' posterior distribution. Filtering methods are currently used to perform these updates efficiently, however, they suffer from eventual degradation as the number of unique values within the filtered samples decreases. We propose Generative Filtering, a method for efficiently performing recursive Bayesian updates in the streaming setting. Generative Filtering retains the speed of a filtering method while using parallel updates to avoid degenerate distributions after repeated applications. We derive rates of convergence for Generative Filtering and conditions for the use of sufficient statistics instead of storing all past data. We investigate properties of Generative Filtering through simulation and ecological species count data.Item Open Access Regression of network data: dealing with dependence(Colorado State University. Libraries, 2019) Marrs, Frank W., author; Fosdick, Bailey K., advisor; Breidt, F. Jay, committee member; Zhou, Wen, committee member; Wilson, James B., committee memberNetwork data, which consist of measured relations between pairs of actors, characterize some of the most pressing problems of our time, from environmental treaty legislation to human migration flows. A canonical problem in analyzing network data is to estimate the effects of exogenous covariates on a response that forms a network. Unlike typical regression scenarios, network data often naturally engender excess statistical dependence -- beyond that represented by covariates -- due to relations that share an actor. For analyzing bipartite network data observed over time, we propose a new model that accounts for excess network dependence directly, as this dependence is of scientific interest. In an example of international state interactions, we are able to infer the networks of influence among the states, such as which states' military actions are likely to incite other states' military actions. In the remainder of the dissertation, we focus on situations where inference on effects of exogenous covariates on the network is the primary goal of the analysis, and thus, the excess network dependence is a nuisance effect. In this setting, we leverage an exchangeability assumption to propose novel parsimonious estimators of regression coefficients for both binary and continuous network data, and new estimators for coefficient standard errors for continuous network data. The exchangeability assumption we rely upon is pervasive in network and array models in the statistics literature, but not previously considered when adjusting for dependence in a regression of network data. Although the estimators we propose are aligned with many network models in the literature, our estimators are derived from the assumption of exchangeability rather than proposing a particular parametric model for representing excess network dependence in the data.Item Open Access Statistical models for COVID-19 infection fatality rates and diagnostic test data(Colorado State University. Libraries, 2023) Pugh, Sierra, author; Wilson, Ander, advisor; Fosdick, Bailey K., advisor; Keller, Kayleigh, committee member; Meyer, Mary, committee member; Gutilla, Molly, committee memberThe COVID-19 pandemic has had devastating impacts worldwide. Early in the pandemic, little was known about the emerging disease. To inform policy, it was essential to develop data science tools to inform public health policy and interventions. We developed methods to fill three gaps in the literature. A first key task for scientists at the start of the pandemic was to develop diagnostic tests to classify an individual's disease status as positive or negative and to estimate community prevalence. Researchers rapidly developed diagnostic tests, yet there was a lack of guidance on how to select a cutoff to classify positive and negative test results for COVID-19 antibody tests developed with limited numbers of controls with known disease status. We propose selecting a cutoff using extreme value theory and compared this method to existing methods through a data analysis and simulation study. Second, there lacked a cohesive method for estimating the infection fatality rate (IFR) of COVID-19 that fully accounted for uncertainty in the fatality data, seroprevalence study data, and antibody test characteristics. We developed a Bayesian model to jointly model these data to fully account for the many sources of uncertainty. A third challenge is providing information that can be used to compare seroprevalence and IFR across locations to best allocate resources and target public health interventions. It is particularly important to account for differences in age-distributions when comparing across locations as age is a well-established risk factor for COVID-19 mortality. There is a lack of methods for estimating the seroprevalence and IFR as continuous functions of age, while adequately accounting for uncertainty. We present a Bayesian hierarchical model that jointly estimates seroprevalence and IFR as continuous functions of age, sharing information across locations to improve identifiability. We use this model to estimate seroprevalence and IFR in 26 developing country locations.