Geostatistical models: model selection and parameter estimation under infill and expanding domain asymptotics
Loading...
Date
Journal Title
Journal ISSN
Volume Title
Abstract
The research presented in this dissertation was originally motivated by the application of spatial models to the field of ecology. Often ecologists are interested in either (1) identifying significant relationships between the response of interest and candidate explanatory variables or (2) generating maps of the mean response and making predictions at unobserved locations. In the former case the scientist is trying to identify significant explanatory variables and understand the underlying relationship(s) with the response. For the latter case the scientist desires to make inference about the model parameters and/or produce predictions at unobserved locations and quantify the variability of these predictors. General linear models have proven to be very effective at addressing these two problems. However, over the past 10 to 20 years with the advent of global positioning systems (GPS), satellite imagery, etc., the ease of obtaining geo-referenced data has increased several fold. As a consequence scientists are now eager to incorporate spatial dependency into the models. A statistical model for a continuous process with geo-referenced data, henceforth referred to as a geostatistical model, provides a powerful means to investigate both motivating questions by accounting for potential spatial relationships. Coupled with any modeling exercise is the ability to compare competing models. This is especially important when trying to identify significant explanatory variables. A common anecdotal observation in the ecological sciences is that "neighbors at close proximity tend to be more similar than neighbors separated by large distances." A strength of the geostatistical model is its ability to prescribe just such a relationship. Although the importance of accounting for spatial correlation has been discussed in other contexts (Cressie, 1993), the effect of spatial correlation on model selection has not been fully explored. We begin in Chapter 2 by developing the AIC statistic for geostatistical models. Roughly speaking, AIC is a measure of the loss of information incurred by fitting an incorrect model to the data. The AIC statistic can be broken down into two components: the first component is a measure of the quality-of-fit and is a function of the likelihood function and the second component is a penalty factor that increases with increased model complexity. Evaluating the likelihood function for geostatistical models is computationally expensive because, in general, there does not exist a closed form for the parameter estimates. Thus optimization requires systematic searching of the parameter space to identify the maximum likelihood estimates. This becomes more and more taxing with increased sample size and/or model complexity. Traditionally in geostatistical modeling, the AIC statistic is used to identify the best subset of explanatory variables assuming independent residuals. Having selected a subset of the explanatory variables, one proceeds to investigate the nature of the correlation structure of the model residuals. If the independence assumption appears to be met the researcher is done. If there appears to be correlation among the residuals, a suitable family is chosen to model the covariance function, the parameters of the trend surface are updated, followed by an updating of the covariance parameters. This process proceeds iteratively until some convergence criterion is met. We refer to this procedure as independent AIC. A deficiency associated with independent AIC is that the importance of one or more explanatory variables may be masked by the covariance structure. Indeed, the presence of one or more additional explanatory variables may reduce or eliminate the presence of correlation in the residuals. Thus we suggest that (possible) correlation in the error process must be incorporated into the model selection process. Through a series of simulations we demonstrate that inclusion of spatial dependence during model selection can greatly improve the probability of identifying the correct model. We also demonstrate that sampling pattern and signal-to-noise ratio impact model selection where we define the signal-to-noise ratio as the ratio of the variability of the mean structure (large scale variability) to the variability of the noise process (small scale variability). Performance comparisons are made between independent and spatial AIC as well as a non-information based procedure, minimum description length (MDL). MDL has the distinct advantage that the researcher need not assume that there exists a true model. Instead, MDL attempts to minimize the amount of "storage space" required to adequately describe the data set. Similar to information-based criteria methods it reduces the data into two components: quality-of-fit and a penalty term that increases with model complexity. We follow with two examples that implement spatial AIC and illustrate the flexibility of the method. The first example attempts to identify the best subset of explanatory variables for species abundance data while assuming the noise process is Matérn. We then construct a simulation study using the selected model to compare the performance of spatial AIC to independent AIC. A second example examines water chemistry response variables collected along a stream network in Maryland. An important complication is that the distance between observation locations can now be defined in one of several ways: Euclidean distance and hydrological distance (restricting movement to "within" the network). Hydrological distance can be further categorized as either symmetric or asymmetric depending on whether or not one accounts for the direction of (water) flow. Ver Hoef et al. (2006) demonstrate that the exponential function, among others, can be used for each of these distance measures, although a carefully constructed weight matrix is required for asymmetric hydrologic distance. Model selection proceeds using spatial AIC and then selected models (one for each distance measure) are compared by evaluating the mean square prediction error (MSPE) for a randomly selected subset of the data that were withheld from model selection/fitting. The derivation of the spatial AIC statistic requires standard asymptotic assumptions. For example, we assume that the parameter estimates for the large scale variation parameters and the correlation parameters are asymptotically unbiased and normally distributed with asymptotic covariance equal to the inverse of the Fisher Information. This motivates Chapters 3, 4, and 5 which set out to show, among other things, that the maximum likelihood estimator (MLE) of the spatial parameters are normally distributed. Since the underlying process is assumed to have a continuous domain, collection of additional observations of the response can proceed in one of two ways: collect additional observations within the current domain (infill) or collect new observations outside the current domain (expansion of the domain). Conceivably the former method of increasing the sample is always available to the scientist, although it may not be practical. However, the latter method is often restricted in the sense that for real applications the domain of interest is finite and hence one cannot expand the domain indefinitely. In either case one can approximate the distribution of the finite sample MLEs with random variables generated over infinite domains with the sampling distributions per unit length. Thus we show that the standard asymptotic assumptions required for the derivation of spatial AIC hold with respect to expansion of the domain with and without infill for the exponential correlation function in one-dimension. Simulation results indicate that the same can be said for both the exponential and the Matérn class of correlation functions in two-dimensions. We begin in Chapter 3 by considering the Ornstein-Uhlenbeck process, the continuous analogue of the discrete first, order autoregressive Gaussian process (AR(1)) in one-dimension. This process is characterized by the range parameter θ which describes the strength of the correlation between two locations as a function of the distance between them. We develop the asymptotic distribution for (1) the MLE of θ when the observations are equally spaced, (2) a weighted least squares estimate for θ when observations are randomly spaced, and (3) the MLE of θ for randomly spaced observations. For each we provide simulation results that corroborate the theoretical results. Chapter 4 proceeds to the two-dimensional case where we explore the asymptotic properties of the MLE of θ for the exponential correlation function through simulation. Chapter 5 investigates the asymptotic behavior of the MLE θ = (θ1, θ2)' for the Matérn class of correlation functions in both one- and two-dimensions. Coupled with each of these analyses is the concept of sampling design by which we mean the prescribed manner in which sampling locations are selected. We demonstrate that infill and expansion of the domain both impact the distribution of the MLE and, for the exponential case, there appears to exist an optimal sampling design for a fixed domain and sampling effort (where sampling effort refers to the total number of observations). The development of spatial AIC along with the (partial) verification of the underlying asymptotic assumptions provides researchers with a powerful tool for conducting model selection and model fitting in the geostatistical framework. Applications are numerous and include diverse fields of study including the ecological, geological, atmospheric, and oceanographic sciences.
Description
Rights Access
Subject
geology
statistics
