Repository logo
 

Theses and Dissertations

Permanent URI for this collection

Browse

Recent Submissions

Now showing 1 - 20 of 90
  • ItemOpen Access
    Multi-channel factor analysis: properties, extensions, and applications
    (Colorado State University. Libraries, 2024) Stanton, Gray, author; Wang, Haonan, advisor; Scharf, Louis, advisor; Kokoszka, Piotr, committee member; Wang, Tianying, committee member; Luo, Jie, committee member
    Multi-channel Factor Analysis (MFA) extends factor analysis to the multi-channel or multi-view setting, where latent common factors influence all channels while distinct factors are specific to individual channels. The within- and across-channel covariance is determined by a low-rank matrix, a block-diagonal matrix with low-rank blocks, and a diagonal matrix, which provides a parsimonious model for both covariances. MFA and related multi-channel methods for data fusion are discussed in Chapter 1. Under conditions on the channel sizes and factor numbers, the results of Chapter 2 show that the generic global identifiability of the aforementioned covariance matrices can be guaranteed a priori, and the estimators obtained by maximizing a Gaussian likelihood are shown to be consistent and asymptotically normal even under misspecification. To handle temporal correlation in the latent factors, Chapter 3 introduces Multi-channel Factor Spectral Analysis (MFSA). Results for the identifiability and parameterization properties of the MFSA spectral density model are derived, and a Majorization-Minimization procedure to optimize the Whittle pseudo-likelihood is designed to estimate the MFSA parameters. A simulation study is conducted to explore how temporal correlations in the latent factors affect estimation, and it is demonstrated that MFSA significantly outperforms MFA when the factor series are highly autocorrelated. In Chapter 4, a locally stationary joint multivariate Gaussian process with MFA-type cross-sectional covariance is developed to model multi-vehicle trajectories in a highway environment. A dynamic model-based clustering procedure is designed to partition cohorts of nearby vehicles into pods based on the stability of the intra-pod relative vehicle configuration. The performance of this procedure is illustrated by its application to the Next GENeration SIMulation dataset of vehicle trajectories on U.S. Highway 101.
  • ItemOpen Access
    A novel approach to statistical problems without identifiability
    (Colorado State University. Libraries, 2024) Adams, Addison D., author; Wang, Haonan, advisor; Zhou, Tianjian, advisor; Kokoszka, Piotr, committee member; Shaby, Ben, committee member; Ray, Indrakshi, committee member
    In this dissertation, we propose novel approaches to random coefficient regression (RCR) and the recovery of mixing distributions under nonidentifiable scenarios. The RCR model is an extension of the classical linear regression model that accounts for individual variation by treating the regression coefficients as random variables. A major interest lies in the estimation of the joint probability distribution of these random coefficients based on the observable samples of the outcome variable evaluated for different values of the explanatory variables. In Chapter 2, we consider fixed-design RCR models, under which the coefficient distribution is not identifiable. To tackle the challenges of nonidentifiability, we consider an equivalence class, in which each element is a plausible coefficient distribution that, for each value of the explanatory variables, yields the same distribution for the outcome variable. In particular, we formulate the approximations of the coefficient distributions as a collection of stochastic inverse problems, allowing for a more flexible nonparametric approach with minimal assumptions. An iterative approach is proposed to approximate the elements by incorporating an initial guess of a solution called the global ansatz. We further study its convergence and demonstrate its performance through simulation studies. The proposed approach is applied to a real data set from an acupuncture clinical trial. In Chapter 3, we consider the problem of recovering a mixing distribution, given a component distribution family and observations from a compound distribution. Most existing methods are restricted in scope in that they are developed for certain component distribution families or continuity structures of mixing distributions. We propose a new, flexible nonparametric approach with minimal assumptions. Our proposed method iteratively steps closer to the desired mixing distribution, starting from a user-specified distribution, and we further establish its convergence properties. Simulation studies are conducted to examine the performance of our proposed method. In addition, we demonstrate the utility of our proposed method through its application to two sets of real-world data, including prostate cancer data and Shakespeare's canon word count.
  • ItemEmbargo
    Bayesian tree based methods for longitudinally assessed environmental mixtures
    (Colorado State University. Libraries, 2024) Im, Seongwon, author; Wilson, Ander, advisor; Keller, Kayleigh, committee member; Koslovsky, Matt, committee member; Neophytou, Andreas, committee member
    In various fields, there is interest in estimating the lagged association between an exposure and an outcome. This is particularly common in environmental health studies, where exposure to an environmental chemical is measured repeatedly during gestation for the assessment of its lagged effects on a birth outcome. The relationship between longitudinally assessed environmental mixtures and a health outcome is also of greater interest. For a single exposure, a distributed lag model (DLM) is a widely used method that provides an appropriate temporal structure for estimating the time-varying effects. For mixture exposures, a distributed lag mixture model is used to address the main effect of each exposure and lagged interactions among exposures. The main inferential goals include estimating the lag-specific effects and identifying a window of susceptibility, during which a fetus is particularly vulnerable. In this dissertation, we propose novel statistical methods for estimating exposure effects of longitudinally assessed environmental mixtures in various scenarios. First, we propose a method that can estimate a linear exposure-time-response function between mixture exposures and a count outcome that may be zero-inflated and overdispersed. To achieve this, we employ a Bayesian Pólya-Gamma data augmentation with a treed distributed lag mixture model framework. We apply the method to estimate the relationship between weekly average fine particulate matter (PM2.5) and temperature and pregnancy loss with live-birth identified conception time series design with administrative data from Colorado. Second, we propose a tree triplet structure to allow for heterogeneity in exposure effects in an environmental mixture exposure setting. Our method accommodates modifier and exposure selection, which allows for personalized and subgroup-specific effect estimation and windows of susceptibility identification. We apply the method to Colorado administrative birth data to examine the heterogeneous relationship between PM2.5 and temperature and birth weight. Finally, we introduce an R package dlmtree that integrates tree structured DLM methods into convenient software. We provide an overview of the embedded tree structured DLMs and use simulated data to demonstrate a model fitting process, statistical inference, and visualization.
  • ItemOpen Access
    Advances in Bayesian spatial statistics for ecology and environmental science
    (Colorado State University. Libraries, 2024) Wright, Wilson J., author; Hooten, Mevin B., advisor; Cooley, Daniel S., advisor; Keller, Kayleigh P., committee member; Kaplan, Andee, committee member; Ross, Matthew R. V., committee member
    In this dissertation, I develop new Bayesian methods for analyzing spatial data from applications in ecology and environmental science. In particular, I focus on methods for mechanistic spatial models and binary spatial processes. I first consider the distribution of heavy metal pollution from a mining road in Cape Krusenstern, Alaska, USA. I develop a mechanistic spatial model that uses the physical process of atmospheric dispersion to characterize the spatial structure in these data. This approach directly incorporates scientific knowledge about how pollutants spread and provides inferences about this process. To assess how the heavy metal pollution impacts the vegetation community in Cape Krusenstern, I also develop a new model that represents plant cover for multiple species using clipped Gaussian processes. This approach is applicable to multiscale and multivariate binary processes that are observed at point locations — including multispecies plant cover data collected using the point intercept method. By directly analyzing the point-level data, instead of aggregating observations to the plot-level, this model allows for inferences about both large-scale and small-scale spatial dependence in plant cover. Additionally, it also incorporates dependence among different species at the small spatial scale. The third model I develop is motivated by ecological studies of wildlife occupancy. Similar to plant cover, species occurrence can be modeled as a binary spatial process. However, occupancy data are inherently measured at areal survey units. I develop a continuous-space occupancy model that accounts for the change of spatial support between the occurrence process and the observed data. All of these models are implemented using Bayesian methods and I present computationally efficient methods for fitting them. This includes a new surrogate data slice sampler for implementing models with latent nearest neighbor Gaussian processes.
  • ItemOpen Access
    Applications of least squares penalized spline density estimator
    (Colorado State University. Libraries, 2024) Jing, Hanxiao, author; Meyer, Mary, advisor; Cooley, Daniel, committee member; Kokoszka, Piotr, committee member; Berger, Joshua, committee member
    The spline-based method stands as one of the most common nonparametric approaches. The work in this dissertation explores three applications of the least squares penalized spline density estimator. Firstly, we present a novel hypothesis test against the unimodality of density functions, based on unimodal and bimodal estimates of the density function, using penalized splines. The test statistic is the difference in the least-squares criterion, between these fits. The distribution of the test statistics under the null hypothesis is estimated via simulated data sets from the unimodal fit. Large sample theory is derived and simulation studies are conducted to compare its performance with other common methods across various scenarios, alongside a real-world application involving neuro-transmission data from guinea pig brains. Secondly, we tackle the deconvolution density estimation problem, introducing the penalized splines deconvolution estimator. Building upon the results gained from piecewise constant splines, we achieve a cube-root convergence rate for piecewise quadratic splines and uniform errors. Moreover, we derive large sample theories for the penalized spline estimator and the constrained spline estimator. Simulation studies illustrate the competitive performance of our estimators compared to the kernel estimators across diverse scenarios. Lastly, drawing inspiration from the preceding applications, we develop a hypothesis test to discern whether the underlying density is unimodal or multimodal, given data with measurement error. Under the assumption of uniform errors, we introduce the test and derive the test statistic. Simulations are conducted to show the performance of the proposed test under different conditions.
  • ItemOpen Access
    Population size estimation using the modified Horvitz-Thompson estimator with estimated sighting probability
    (Colorado State University. Libraries, 1996) Wong, Char-Ngan, author; Bowden, David C., advisor
    Wildlife aerial population surveys usually use a two-stage sampling technique. The first stage involves dividing the whole survey area into smaller land units, which we called the primary units, and then taking a sample from those. In the second stage, an aerial survey of the selected units is made in an attempt to observe (count) every animal. Some animals, usually occurring in groups, are not observed for a variety of reasons. Estimates from these surveys are plagued with two major sources of errors, namely, errors due to sampling variation in both stages. The first error may be controlled by choosing a suitable sampling plan for the first stage. The second error is also termed "visibility bias", which acknowledges that only a portion of the groups in a sampled land unit will be enumerated. The objective of our study is to provide improved variance estimators over those provided by Steinhorst and Samuel (1989) and to evaluate performances of various corresponding interval procedures for estimating population size. For this purpose, we have found an asymptotically unbiased estimator for the approximate variance of the population size estimator when sighting probabilities of groups are unknown and fitted with a logistic model. We have broken down the approximate variance term into three components, namely, error due to sampling of primary units, error due to sighting of groups in second stage sampling and error due all three components separately in order to get a better insight to error control. Simplified versions of variance estimators are provided when all primary units are surveyed and for stratified random sampling of primary units. Third central moment of population size estimator was also obtained. Simulation studies were conducted to evaluate performances of our asymptotically unbiased variance estimators and of confidence interval procedures such as the large sample procedure, with and without transformation, for constructing 90% and 95% confidence intervals for the population size. Confidence intervals for the population size were also constructed by assuming that the distribution of log(T-T) is normally distributed, where f is the population size estimate and T is the number of animals seen in a sample obtained from a population survey. From our simulation results, we observed that the population size is estimated with negligible bias (according to Cochran's (1977) working rule) with a sample of at least 100 groups of elk obtained from a population survey when sighting probabilities are known. When sighting probabilities are unknown, one needs to conduct a sightability survey to obtain a sample, independent of the sample obtained from a population survey, for fitting a logistic model to estimate sighting probabilities of sighted groups in the sample obtained from the population survey. In this case, the population size is also estimated with negligible bias when the sample size of both samples is at least 100 groups of elk. We also observed that when sighting probabilities are known, we needed a sample of at least 348 groups of elk from a population survey to obtain reasonable coverage rates of the true population size. When sighting probabilities are unknown and estimated via logistic regression, the size of both samples is at least 428 groups of elk for obtaining reasonable coverage rates of the true population size. Among all these confidence intervals, we found that those approximate confidence intervals constructed based on the assumption that log (T-T) is normally distributed and using the delta method have better coverage rates and shorter estimated expected interval widths. Confidence intervals for the population size using bootstrapping were also evaluated. We were unable to find an existing bootstrapping procedure which could be directly applied to our problem. We have, therefore, proposed a couple of bootstrapping procedures for obtaining a sample to fit a logistic model and a couple of bootstrapping procedures for obtaining a sample to construct a population size estimate. With 1000 pairs of independent samples from a sightability survey and a population survey, each sample of size 107 groups of elk and using 500 bootstrap iterations, we obtained reasonable coverage rates of the true population size. Our other problem is model selection of a logistic model for the unknown sighting probabilities. We evaluated the performance of the population size estimator and our variance estimator when we fit a simpler model. For this purpose, we have derived theoretical expressions for the bias of the population size estimator and the mean-squared-error. We found, from our simulation results of fitting a couple of models simpler than the full model, that the population size was still well estimated for the fitted model based only on group size but was severely overestimated for the fitted model based only on percent of vegetation cover. For both fitted models, our variance estimator overestimated the observed variance of 1000 simulated population size estimates. We also found that the approximate expression of the expected value of the population size estimator we derived for a fitted model simpler than the full model has negligible bias (by Cochran's (1977) working rule) relative to the average of those 1000 simulated population size estimates. The approximate expression of the variance of the population size estimator we derived for this case somewhat underestimated the observed variance of those 1000 simulated population size estimates. Both approximate expressions apparently give us an idea of the expected size of the population size estimate and its variance when the fitted model is not the full model.
  • ItemEmbargo
    Functional methods in outlier detection and concurrent regression
    (Colorado State University. Libraries, 2024) Creutzinger, Michael L., author; Cooley, Daniel, advisor; Sharp, Julia L., advisor; Koslovsky, Matt, committee member; Liebl, Dominik, committee member; Ortega, Francisco, committee member
    Functional data are data collected on a curve, or surface, over a continuum. The growing presence of high-resolution data has greatly increased the popularity of using and developing methods in functional data analysis (FDA). Functional data may be defined differently from other data structures, but similar ideas apply for these types of data including data exploration, modeling and inference, and post-hoc analyses. The methods presented in this dissertation provide a statistical framework that allows a researcher to carry out an analysis of functional data from "start to finish''. Even with functional data, there is a need to identify outliers prior to conducting statistical analysis procedures. Existing functional data outlier detection methodology requires the use of a functional data depth measure, functional principal components, and/or an outlyingness measure like Stahel-Donoho. Although effective, these functional outlier detection methods may not be easily interpreted. In this dissertation, we propose two new functional outlier detection methods. The first method, Practical Outlier Detection (POD), makes use of ordinary summary statistics (e.g., minimum, maximum, mean, variance, etc). In the second method, we developed a Prediction Band Outlier Detection (PBOD) method that makes use of parametric, simultaneous, prediction bands that meet nominal coverage levels. The two new outlier detection methods were compared to three existing outlier detection methods: MS-Plot, Massive Unsupervised Outlier Detection, and Total Variation Depth. In the simulation results, POD performs as well, or better, than its counterparts in terms of specificity, sensitivity, accuracy, and precision. Similar results were found for PBOD, except for noticeably smaller values of specificity and accuracy than all other methods. Following data exploration and outlier detection, researchers often model their data. In FDA, functional linear regression uses a functional response Yi(t) and scalar and/or functional predictors, Xi(t). A functional concurrent regression model is estimated by regressing Yi on Xi pointwise at each sampling point, t. After estimating a regression model (functional or non-functional), it is common to estimate confidence and prediction intervals for parameter(s), including the conditional mean. A common way to obtain confidence/prediction intervals for simultaneous inference across the sampling domain is to use resampling methods (e.g., bootstrapping or permutation). We propose a new method for estimating parametric, simultaneous confidence and prediction bands for a functional concurrent regression model, without the use of resampling. The method uses Kac-Rice formulas for estimation of a critical value function, which is used with a functional pivot to acquire the simultaneous band. In the results, the proposed method meets nominal coverage levels for both confidence and prediction bands. The method we propose is also substantially faster to compute than methods that require resampling techniques. In linear regression, researchers may also assess if there are influential observations that may impact the estimates and results from the fitted model. Studentized difference in fits (DFFITS), studentized difference in regression coefficient estimates (DFBETAS), and/or Cook's Distance (D) can all be used to identify influential observations. For functional concurrent regression, these measures can be easily computed pointwise for each observation. However, the only current development is to use resampling techniques for estimating a null distribution of the average of each measure. Rather than using the average values and bootstrapping, we propose working with functional DFFITS (DFFITS(t)) directly. We show that if the functional errors are assumed to follow a Gaussian process, DFFITS(t) is distributed uniformly as a scaled Student's t process. Then, we propose using a multivariate Student's t distributional quantile for identifying influential functional observations with DFFITS(t). Our methodology ("Theoretical'') is compared against a competing method that uses a parametric bootstrapping technique ("Bootstrapped'') for estimating the null distribution of the mean absolute value of DFFITS(t). In the simulation and case study results, we find that the Theoretical method greatly reduces the computation time, without much loss in performance as measured by accuracy (ACC), precision (PPV), and Matthew's Correlation Coefficient (MCC), than the Bootstrapped method. Furthermore, the average sensitivity of the Theoretical method is higher in all scenarios than the Bootstrapped method.
  • ItemUnknown
    Statistical modeling and inferences on directed networks
    (Colorado State University. Libraries, 2024) Du, Wenqin, author; Zhou, Wen, advisor; Breidt, F. Jay, committee member; Meyer, Mary, committee member; Pezeshki, Ali, committee member
    Network data has received great attention for elucidating comprehensive insights into nodes interactions and underlying network dynamics. This dissertation contributes new modeling tools and inference procedures to the field of network analysis, incorporating the dependence structure inherently introduced by the network data. Our first direction centers on modeling directed edges with count measurements, an area that has received limited attention in the literature. Most existing methods either assume the count edges are derived from continuous random variables or model the edge dependence by parametric distributions. In this dissertation, we develop a latent multiplicative Poisson model for directed network with count edges. Our approach directly models the edge dependence of count data by the pairwise dependence of latent errors, which are assumed to be weakly exchangeable. This assumption not only covers a variety of common network effects, but also leads to a concise representation of the error covariance. In addition, identification and inference of the mean structure, as well as the regression coefficients, depend on the errors only through their covariance, which provides substantial flexibility for our model. We propose a pseudo-likelihood based estimator for the regression coefficients that enjoys consistency and asymptotic normality. We evaluate our method by extensive numerical studies that corroborate the theory and apply our model to a food sharing network data to reveal interesting network effects that are further verified in literature. In the second project, we study the inference procedure of network dependence structures. While much research has targeted network-covariate associations and community detection, the inference of important network effects such as the reciprocity and sender-receiver effects has been largely overlooked. Testing network effects for network data or weighted directed networks is challenging due to the intricate potential edge dependence. Most existing methods are model-based, carrying strong assumptions with restricted applicability. In contrast, we present a novel, fully nonparametric framework that requires only minimal regularity assumptions. While inspired by recent developments in U-statistic literature, our work significantly broadens their scopes. Specifically, we identified and carefully addressed the indeterminate degeneracy inherent in network effect estimators - a challenge that aforementioned tools do not handle. We established Berry-Esseen type bound for the accuracy of type-I error rate control, as well as novel analysis show the minimax optimality of our test's power. Simulations highlight the superiority of our method in computation speed, accuracy, and numerical robustness relative to benchmarks. To showcase the practicality of our methods, we apply them to two real-world relationship networks, one in faculty hiring networks and the other in international trade networks. Finally, this dissertation introduces modeling strategies and corresponding methods for discerning the core-periphery (CP) structure in weighted directed networks. We adopt the signal-plus-noise model, categorizing uniform relational patterns as non-informative, by which we define the sender and receiver peripheries. Furthermore, instead of confining the core component to a specific structure, we consider it complementary to either the sender or receiver peripheries. Based on our definitions of the sender and receiver peripheries, we propose spectral algorithms to identify the CP structure in weighted directed networks. Our algorithm stands out with statistical guarantees, ensuring the identification of sender and receiver peripheries with overwhelmingly probability. Additionally, our methods scale effectively for expansive directed networks. We evaluate the proposed methods in extensive simulation studies and applied it to a faculty hiring network data, revealing captivating insights into the informative and non-informative sender/receiver behaviors.
  • ItemUnknown
    Test of change point versus long-range dependence in functional time series
    (Colorado State University. Libraries, 2024) Meng, Xiangdong, author; Kokoszka, Piotr S., advisor; Cooley, Dan, committee member; Wang, Haonan, committee member; Miao, Hong, committee member
    In scalar time series analysis, a long-range dependent (LRD) series cannot be easily distinguished from certain non-stationary models, such as the change in mean model with short-range dependent (SRD) errors. To be specific, realizations of LRD series usually have a characteristic of changing local mean if the time span taken into account is long enough, which resembles the behavior of change in mean models. Test procedure for distinguishing between these two types of model has been investigated a lot in scalar case, see e.g. Berkes et al. (2006) and Baek and Pipiras (2012) and references therein. However, no analogous test for functional observations has been developed yet, partly because of omitted methods and theory for analyzing functional time series with long-range dependence. My dissertation establishes a procedure for testing change in mean models with SRD errors against LRD processes in functional case, which is an extension of the method of Baek and Pipiras (2012). The test builds on the local Whittle (LW) (or Gaussian semiparametric) estimation of the self-similarity parameter, which is based on the estimated level 1 scores of a suitable functional residual process. Remarkably, unlike other parametric methods such as Whittle estimation, whose asymptotic properties heavily depend on validity of the underlying spectral density on the full frequency range (−π, π], LW estimation imposes mild restrictions on the spectral density only near the origin and is thus more robust to model misspecification. We shall prove that the test statistic based on LW estimation is asymptotically normally distributed under the null hypothesis and it diverges to infinity under the LRD alternative.
  • ItemUnknown
    Estimation for Lévy-driven CARMA processes
    (Colorado State University. Libraries, 2008) Yang, Yu, author; Brockwell, Peter J., advisor; Davis, Richard A., advisor
    This thesis explores parameter estimation for Lévy-driven continuous-time autoregressive moving average (CARMA) processes, using uniformly and closely spaced discrete-time observations. Specifically, we focus on developing estimation techniques and asymptotic properties of the estimators for three particular families of Lévy-driven CARMA processes. Estimation for the first family, Gaussian autoregressive processes, was developed by deriving exact conditional maximum likelihood estimators of the parameters under the assumption that the process is observed continuously. The resulting estimates are expressed in terms of stochastic integrals which are then approximated using the available closely-spaced discrete-time observations. We apply the results to both linear and non-linear autoregressive processes. For the second family, non-negative Lévy-driven Ornestein-Uhlenbeck processes, we take advantage of the non-negativity of the increments of the driving Lévy process to derive a highly efficient estimation procedure for the autoregressive coefficient when observations are available at uniformly spaced times. Asymptotic properties of the estimator are also studied and a procedure for obtaining estimates of the increments of the driving Lévy process is developed. These estimated increments are important for identifying the nature of the driving Lévy process and for estimating its parameters. For the third family, non-negative Lévy-driven CARMA processes, we estimate the coefficients by maximizing the Gaussian likelihood of the observations and discuss the asymptotic properties of the estimators. We again show how to estimate the increments of the background driving Lévy process and hence to estimate the parameters of the Lévy process itself. We assess the performance of our estimation procedures by simulations and use them to fit models to real data sets in order to determine how the theory applies in practice.
  • ItemUnknown
    Spatial models with applications in computer experiments
    (Colorado State University. Libraries, 2008) Wang, Ke, author; Davis, Richard A., advisor; Breidt, F. Jay, advisor
    Often, a deterministic computer response is modeled as a realization from a, stochastic process such as a Gaussian random field. Due to the limitation of stationary Gaussian process (GP) in inhomogeneous smoothness, we consider modeling a deterministic computer response as a realization from a stochastic heteroskedastic process (SHP), a stationary non-Gaussian process. Conditional on a latent process, the SHP has non-stationary covariance function and is a non-stationary GP. As such, the sample paths of this process exhibit greater variability and hence offer more modeling flexibility than those produced by a, traditional GP model. We use maximum likelihood for inference in the SHP model, which is complicated by the high dimensionality of the latent process. Accordingly, we develop an importance sampling method for likelihood computation and use a low-rank kriging approximation to reconstruct the latent process. Responses at unobserved locations can be predicted using empirical best predictors or by empirical best linear unbiased predictors. In addition, prediction error variances are obtained. The SHP model can be used in an active learning context, adaptively selecting new locations that provide improved estimates of the response surface. Estimation, prediction, and adaptive sampling with the SHP model are illustrated with several examples. Our spatial model can be adapted to model the first partial derivative process. The derivative process provides additional information about the shape and smoothness of the underlying deterministic function and can assist in the prediction of responses at unobserved sites. The unconditional correlation function for the derivative process presents some interesting properties, and can be used as a new class of spatial correlation functions. For parameter estimation, we propose to use a similar strategy to develop an importance sampling technique to compute the joint likelihood of responses and derivatives. The major difficulties of bringing in derivative information are the increase in the dimensionality of the latent process and the numerical problems of inverting the enlarged covariance matrix. Some possible ways to utilize this information more efficiently are proposed.
  • ItemUnknown
    Data mining techniques for temporal point processes applied to insurance claims data
    (Colorado State University. Libraries, 2008) Iverson, Todd Ashley, author; Ben-Hur, Asa, advisor; Iyer, Hariharan K., advisor
    We explore data mining on databases consisting of insurance claims information. This dissertation focuses on two major topics we considered by way of data mining procedures. One is the development of a classification rule using kernels and support vector machines. The other is the discovery of association rules using the Apriori algorithm, its extensions, as well as a new association rules technique. With regard to the first topic we address the question-can kernel methods using an SVM classifier be used to predict patients at risk of type 2 diabetes using three years of insurance claims data? We report the results of a study in which we tested the performance of new methods for data extracted from the MarketScan® database. We summarize the results of applying popular kernels, as well as new kernels constructed specifically for this task, for support vector machines on data derived from this database. We were able to predict patients at risk of type 2 diabetes with nearly 80% success when combining a number of specialized kernels. The specific form of the data, that of a timed sequence, led us to develop two new kernels inspired by dynamic time warping. The Global Time Warping (GTW) and Local Time Warping (LTW) kernels build on an existing time warping kernel by including the timing coefficients present in classical time warping, while providing a solution for the diagonal dominance present in most alignment methods. We show that the LTW kernel performs significantly better than the existing time warping kernel when the times contained relevant information. With regard to the second topic, we provide a new theorem on closed rules that could help substantially improve the time to find a specific type of rule. An insurance claims database contains codes indicating associated diagnoses and the resulting procedures for each claim. The rules that we consider are of the form diagnoses imply procedures. In addition, we introduce a new class of interesting association rules in the context of medical claims databases and illustrate their potential uses by extracting example rules from the MarketScan® database.
  • ItemUnknown
    Spatial processes with stochastic heteroscedasticity
    (Colorado State University. Libraries, 2008) Huang, Wenying, author; Breidt, F. Jay, advisor; Davis, Richard A., advisor
    Stationary Gaussian processes are widely used in spatial data modeling and analysis. Stationarity is a relatively restrictive assumption regarding spatial association. By introducing stochastic volatility into a Gaussian process, we propose a stochastic heteroscedastic process (SHP) with conditional nonstationarity. That is, conditional on a latent Gaussian process, the SHP is a Gaussian process with non-stationary covariance structure. Unconditionally, the SHP is a stationary non-Gaussian process. The realizations from SHP are versatile and can represent spatial inhomogeneities. The unconditional correlation of SHP offers a rich class of correlation functions which can also allow for a smoothed nugget effect. For maximum likelihood estimation, we propose to apply importance sampling in the likelihood calculation and latent process estimation. The importance density we constructed is of the same dimensionality as the observations. When the sample size is large, the importance sampling scheme becomes infeasible and/or inaccurate. A low-dimensional approximation model is developed to solve the numerical difficulties. We develop two spatial prediction methods: PBP (plug-in best predictor) and PBLUP (plug-in best linear unbiased predictor). Empirical results with simulated and real data show improved out-of-sample prediction performance of SHP modeling over stationary Gaussian process modeling. We extend the single-realization model to SHP model with replicates. The spatial replications are modeled as independent realizations from a SHP model conditional on a common latent process. A simulation study shows substantial improvements in parameter estimation and process prediction when replicates are available. In a example with real atmospheric deposition data, the SHP model with replicates outperforms the Gaussian process model in prediction by capturing the spatial volatilities.
  • ItemUnknown
    Estimation of structural breaks in nonstationary time series
    (Colorado State University. Libraries, 2008) Hancock, Stacey, author; Davis, Richard A., advisor; Iyer, Hari K., advisor
    Many time series exhibit structural breaks in a variety of ways, the most obvious being a mean level shift. In this case, the mean level of the process is constant over periods of time, jumping to different levels at times called change-points. These jumps may be due to outside influences such as changes in government policy or manufacturing regulations. Structural breaks may also be a result of changes in variability or changes in the spectrum of the process. The goal of this research is to estimate where these structural breaks occur and to provide a model for the data within each stationary segment. The program Auto-PARM (Automatic Piecewise AutoRegressive Modeling procedure), developed by Davis, Lee, and Rodriguez-Yam (2006), uses the minimum description length principle to estimate the number and locations of change-points in a time series by fitting autoregressive models to each segment. The research in this dissertation shows that when the true underlying model is segmented autoregressive, the estimates obtained by Auto-PARM are consistent. Under a more general time series model exhibiting structural breaks, Auto-PARM's estimates of the number and locations of change-points are again consistent, and the segmented autoregressive model provides a useful approximation to the true process. Weak consistency proofs are given, as well as simulation results when the true process is not autoregressive. An example of the application of Auto-PARM as well as a source of inspiration for this research is the analysis of National Park Service sound data. This data was collected by the National Park Service over four years in around twenty of the National Parks by setting recording devices in several sites throughout the parks. The goal of the project is to estimate the amount of manmade sound in the National Parks. Though the project is in its initial stages, Auto-PARM provides a promising method for analyzing sound data by breaking the sound waves into pseudo-stationary pieces. Once the sound data have been broken into pieces, a classification technique can be applied to determine the type of sound in each segment.
  • ItemUnknown
    Confidence regions for level curves and a limit theorem for the maxima of Gaussian random fields
    (Colorado State University. Libraries, 2009) French, Joshua, author; Davis, Richard A., advisor
    One of the most common display tools used to represent spatial data is the contour plot. Informally, a contour plot is created by taking a "slice" of a three-dimensional surface at a certain level of the response variable and projecting the slice onto the two-dimensional coordinate-plane. The "slice" at each level is known as a level curve.
  • ItemUnknown
    Applications of generalized fiducial inference
    (Colorado State University. Libraries, 2009) E, Lidong, author; Iyer, Hariharan K., advisor
    Hannig (2008) generalized Fisher's fiducial argument and obtained a fiducial recipe for interval estimation that is applicable in virtually any situation. In this dissertation research, we apply this fiducial recipe and fiducial generalized pivotal quantity to make inference in four practical problems. The list of problems we consider is (a) confidence intervals for variance components in an unbalanced two-component normal mixed linear model; (b) confidence intervals for median lethal dose (LD50) in bioassay experiments; (c) confidence intervals for the concordance correlation coefficient (CCC) in method comparison; (d) simultaneous confidence intervals for ratios of means of Lognormal distributions. For all the fiducial generalized confidence intervals (a)-(d), we conducted a simulation study to evaluate their performance and compare them with other competing confidence interval procedures from the literature. We also proved that the intervals (a) and (d) have asymptotically exact frequentist coverage.
  • ItemOpen Access
    Penalized estimation for sample surveys in the presence of auxiliary variables
    (Colorado State University. Libraries, 2008) Delorey, Mark J., author; Breidt, F. Jay, advisor
    In conducting sample surveys, time and financial resources can be limited but research questions are wide and varied. Thus, methods for analysis must make the best use of whatever data are available and produce results that address a variety of needs. Motivation for this research comes from surveys of aquatic resources, in which sample sizes are small to moderate, but auxiliary information is available to supplement measured survey responses. The problems of survey estimation are considered, tied together in their use of constrained/penalized estimation techniques for combining information from the auxiliary information and the responses of interest. We study a small area problem with the goal of obtaining a good ensemble estimate, that is, a collection of estimates for individual small areas that collectively give a good estimate of the overall distribution function across small areas. Often, estimators that are good for one purpose may not be good for others. For example, estimation of the distribution function itself (as in Cordy and Thomas, 1997) can address questions of variability and extremes but does not provide individual estimators of the small areas, nor is it appropriate when auxiliary information can be made of use. Bayes estimators are good individual estimators in terms of mean squared error but are not variable enough to represent ensemble traits (Ghosh, 1992). An algorithm that extends the constrained Bayes (CB) methods of Louis (1984) and Ghosh (1992) for use in a model with a general covariance matrix is presented. This algorithm produces estimators with similar properties as (CB), and we refer to this method as general constrained Bayes (GCB). The ensemble GCB estimator is asymptotically unbiased for the posterior mean of the empirical distribution function (edf). The ensemble properties of transformed GCB estimates are investigated to determine if the desirable ensemble characteristics displayed by the GCB estimator are preserved under such transformations. The GCB algorithm is then applied to complex models such as conditional autoregressive spatial models and to penalized spline models. Illustrative examples include the estimation of lip cancer risk, mean water acidity, and rates of change in water acidity. We also study a moderate area problem in which the goal is to derive a set of survey weights that can be applied to each study variable with reasonable predictive results. Zheng and Little (2003) use penalized spline regression in a model-based approach for finite population estimation in a two-stage sample when predictor variables are available. Breidt et al. (2005) propose a class of model-assisted estimators based on penalized spline regression in single stage sampling. Because unbiasedness of the model-based estimator requires that the model be correctly specified, we look at extending model-assisted estimation to the two-stage case. By calibrating the degrees of freedom of the smooth to the most important study variables, a set of weights can be obtained that produce design consistent estimators for all study variables. The model-assisted estimator is compared to other estimators in a simulation study. Results from the simulation study show that the model-assisted estimator is comparable to other estimators when the model is correctly specified and generally superior when the model is incorrectly specified.
  • ItemOpen Access
    State-space models for stream networks
    (Colorado State University. Libraries, 2007) Coar, William J., author; Breidt, F. Jay, advisor
    The natural branching that occurs in a stream network, in which two upstream reaches merge to create a new downstream reach, generates a tree structure. Furthermore, because of the natural flow of water in a stream network, characteristics of a downstream reach may depend on characteristics of upstream reaches. Since the flow of water from reach to reach provides a natural time-like ordering throughout the stream network, we propose a state-space model to describe the spatial dependence in this tree-like structure with ordering based on flow. Developing a state-space formulation permits the use of the well known Kalman recursions. Variations of the Kalman Filter and Smoother are derived for the tree-structured state-space model, which allows recursive estimation of unobserved states and prediction of missing observations on the network, as well as computation of the Gaussian likelihood, even when the data are incomplete. To reduce the computational burden that may be associated with optimization of this exact likelihood, a version of the expectation-maximization (EM) algorithm is presented that uses the Kalman Smoother to fill in missing values in the E-step, and maximizes the Gaussian likelihood for the completed dataset in the M-step. Several forms of dependence for discrete processes on a stream network are considered, such as network analogues of the autoregressive-moving average model and stochastic trend models. Network parallels for first and second differences in time-series are defined, which allow for definition of a spline smoother on a stream network through a special case of a local linear trend model. We have taken the approach of modeling a discrete process, which we see as a building block to more appropriate yet more complicated models. Adaptation of this state-space model and Kalman prediction equations to allow for more complicated forms of spatial and perhaps temporal dependence is a potential area of future research. Other possible directions for future research are non-Gaussian and nonlinear error structures, model selection, and properties of estimators.
  • ItemOpen Access
    Statistical modeling with COGARCH(p,q) processes
    (Colorado State University. Libraries, 2009) Chadraa, Erdenebaatar, author; Brockwell, Peter J., advisor
    In this paper, a family of continuous time GARCH processes, generalizing the COGARCH(1, 1) process of Klüppelberg, et al. (2004), is introduced and studied. The resulting COGARCH(p,q) processes, q ≥ p ≥ 1, exhibit many of the characteristic features of observed financial time series, while their corresponding volatility and squared increment processes display a broader range of autocorrelation structures than those of the COGARCH(1, 1) process. We establish sufficient conditions for the existence of a strictly stationary non-negative solution of the equations for the volatility process and, under conditions which ensure the finiteness of the required moments, determine the autocorrelation functions of both the volatility and squared increment processes. The volatility process is found to have the autocorrelation function of a continuous-time ARMA process while the squared increment process has the autocorrelation function of an ARMA process.
  • ItemOpen Access
    Model selection based on expected squared Hellinger distance
    (Colorado State University. Libraries, 2007) Cao, Xiaofan, author; Iyer, Hariharan K., advisor; Wang, Haonan, advisor
    This dissertation is motivated by a general model selection problem such that the true model is unknown and one or more approximating parametric families of models are given along with strategies for estimating the parameters using data. We develop model selection methods based on Hellinger distance that can be applied to a wide range of modeling problems without posing the typical assumptions for the true model to be within the approximating families or to come from a particular parametric family. We propose two estimators for the expected squared Hellinger distance as the model selection criteria.