Browsing by Author "Breidt, F. Jay, committee member"
Now showing 1 - 20 of 30
- Results Per Page
- Sort Options
Item Open Access A penalized estimation procedure for varying coefficient models(Colorado State University. Libraries, 2015) Tu, Yan, author; Wang, Haonan, advisor; Breidt, F. Jay, committee member; Chapman, Phillip, committee member; Luo, J. Rockey, committee memberVarying coefficient models are widely used for analyzing longitudinal data. Various methods for estimating coefficient functions have been developed over the years. We revisit the problem under the theme of functional sparsity. The problem of sparsity, including global sparsity and local sparsity, is a recurrent topic in nonparametric function estimation. A function has global sparsity if it is zero over the entire domain, and it indicates that the corresponding covariate is irrelevant to the response variable. A function has local sparsity if it is nonzero but remains zero for a set of intervals, and it identifies an inactive period of the corresponding covariate. Each type of sparsity has been addressed in the literature using the idea of regularization to improve estimation as well as interpretability. In this dissertation, a penalized estimation procedure has been developed to achieve functional sparsity, that is, simultaneously addressing both types of sparsity in a unified framework. We exploit the property of B-spline approximation and group bridge penalization. Our method is illustrated in simulation study and real data analysis, and outperforms the existing methods in identifying both local sparsity and global sparsity. Asymptotic properties of estimation consistency and sparsistency of the proposed method are established. The term of sparsistency refers to the property that the functional sparsity can be consistently detected.Item Open Access A posteriori error estimates for the Poisson problem on closed, two-dimensional surfaces(Colorado State University. Libraries, 2011) Newton, William F., author; Estep, Donald J., 1959-, advisor; Holst, Michael J., committee member; Tavener, Simon, committee member; Zhou, Yongcheng, committee member; Breidt, F. Jay, committee memberThe solution of partial differential equations on non-Euclidean Domains is an area of much research in recent years. The Poisson Problem is a partial differential equation that is useful on curved surfaces. On a curved surface, the Poisson Problem features the Laplace-Beltrami Operator, which is a generalization of the Laplacian and specific to the surface where the problem is being solved. A Finite Element Method for solving the Poisson Problem on a closed surface has been described and shown to converge with order h2. Here, we review this finite element method and the background material necessary for defining it. We then construct an adjoint-based a posteriori error estimate for the problem, discuss some computational issues that arise in solving the problem and show some numerical examples. The major sources of numerical error when solving the Poisson problem are geometric error, discretization error, quadrature error and measurement error. Geometric error occurs when distances, areas and angles are distorted by using a flat domain to parametrize a curved one. Discretization error is a result of using a finite-dimensional space of functions to approximate an infinite-dimensional space. Quadrature error arises when we use numerical quadrature to evaluate integrals necessary for the finite element method. Measurement error arises from error and uncertainty in our knowledge of the surface itself. We are able to estimate the amount of each of these types of error and show when each type of error will be significant.Item Open Access Absolute and relative chronology of a complex alpine game drive site (5BL148), Rollins Pass, Colorado(Colorado State University. Libraries, 2019) Meyer, Kelton A., author; LaBelle, Jason M., advisor; Glantz, Michelle M., committee member; Breidt, F. Jay, committee memberNative American alpine game drive sites are recognized along major mountain travel corridors in Colorado's Southern Rockies. The Rollins Pass project area, located east of Winter Park, represents the densest concentration of alpine game drive sites in North America. Game drives at Rollins Pass vary in terms of size, frequency and diversity of features and artifacts, but also landform context. Past game drive research at Rollins Pass and elsewhere in the Colorado Front Range demonstrates that hunter-gatherer groups reoccupied some sites for centuries and even millennia, creating an amalgamation of material culture over the course of time. However, chronological reconstructions in alpine environments are limited by poor preservation, lacking stratigraphy, and the ephemeral nature of hunter-gatherer occupations at high altitudes. This thesis considers an investigation of the largest game drive at Rollins Pass, 5BL148, with a focus on chronology reconstruction. A relative occupation span is provided with an analysis of chipped stone tools and jewelry. Lichenometry is used to determine the age of lichen colonization events on stone walls, and radiocarbon dates on faunal remains and charcoal are used as absolute chronological measures. A spatial analysis of the artifact and feature assemblage is further used to identify evidence for distinct or temporally overlapping occupation episodes. The results indicate that 5BL148 represents a palimpsest of hunter-gatherer occupations, beginning in the Early Archaic era and ending in the Protohistoric era.Item Open Access Analysis of structured data and big data with application to neuroscience(Colorado State University. Libraries, 2015) Sienkiewicz, Ela, author; Wang, Haonan, advisor; Meyer, Mary, committee member; Breidt, F. Jay, committee member; Hayne, Stephen, committee memberNeuroscience research leads to a remarkable set of statistical challenges, many of them due to the complexity of the brain, its intricate structure and dynamical, non-linear, often non-stationary behavior. The challenge of modeling brain functions is magnified by the quantity and inhomogeneity of data produced by scientific studies. Here we show how to take advantage of advances in distributed and parallel computing to mitigate memory and processor constraints and develop models of neural components and neural dynamics. First we consider the problem of function estimation and selection in time-series functional dynamical models. Our motivating application is on the point-process spiking activities recorded from the brain, which poses major computational challenges for modeling even moderately complex brain functionality. We present a big data approach to the identification of sparse nonlinear dynamical systems using generalized Volterra kernels and their approximation using B-spline basis functions. The performance of the proposed method is demonstrated in experimental studies. We also consider a set of unlabeled tree objects with topological and geometric properties. For each data object, two curve representations are developed to characterize its topological and geometric aspects. We further define the notions of topological and geometric medians as well as quantiles based on both representations. In addition, we take a novel approach to define the Pareto medians and quantiles through a multi-objective optimization problem. In particular, we study two different objective functions which measure the topological variation and geometric variation respectively. Analytical solutions are provided for topological and geometric medians and quantiles, and in general, for Pareto medians and quantiles the genetic algorithm is implemented. The proposed methods are applied to analyze a data set of pyramidal neurons.Item Open Access Automatic parallelization of "inherently" sequential nested loop programs(Colorado State University. Libraries, 2011) Zou, Yun, author; Rajopadhye, Sanjay, advisor; Strout, Michelle, committee member; Bohm, A. P. Willem, committee member; Breidt, F. Jay, committee memberMost automatic parallelizers are based on detection of independent operations, and most of them cannot do anything if there is a true dependence between operations. However, there exists a class of programs for which this can be surmounted based on the nature of the operations. The standard and obvious cases are reductions and scans, which normally occur within loops. Existing work that deals with complicated reductions and scans normally focuses on the formalism, not the implementation. To help eliminate the gap between the formalism and implementation, we present a method for automatically parallelizing such "inherently" sequential programs. Our method is based on exact dependence analysis in the polyhedral model, and we formulate the problem as a detection that the loop body performs a computation that is equivalent to a matrix multiplication over a semiring. It handles both a single loop as well as arbitrarily nested loops. We also deal with mutually dependent variables in the loop. Our scan detection is implemented in a polyhedral program transformation and code generation system (AlphaZ) and used to generate OpenMP code. We also present optimization strategies to help improve the performance of the generated code. Experiments on examples demonstrate the scalability of programs parallelized by our implementation.Item Open Access Change-Point estimation using shape-restricted regression splines(Colorado State University. Libraries, 2016) Liao, Xiyue, author; Meyer, Mary C., advisor; Breidt, F. Jay, committee member; Homrighausen, Darren, committee member; Belfiori, Elisa, committee memberChange-Point estimation is in need in fields like climate change, signal processing, economics, dose-response analysis etc, but it has not yet been fully discussed. We consider estimating a regression function ƒm and a change-point m, where m is a mode, an inflection point, or a jump point. Linear inequality constraints are used with spline regression functions to estimate m and ƒm simultaneously using profile methods. For a given m, the maximum-likelihood estimate of ƒm is found using constrained regression methods, then the set of possible change-points is searched to find the ˆm that maximizes the likelihood. Convergence rates are obtained for each type of change-point estimator, and we show an oracle property, that the convergence rate of the regression function estimator is as if m were known. Parametrically modeled covariates are easily incorporated in the model. Simulations show that for small and moderate sample sizes, these methods compare well to existing methods. The scenario when the random error is from a stationary autoregressive process is also presented. Under such a scenario, the change-point and parameters of the stationary autoregressive process, such as autoregressive coefficients and the model variance, are estimated together via Cochran-Orcutt-type iterations. Simulations are conducted and it is shown that the change-point estimator performs well in terms of choosing the right order of the autoregressive process. Penalized spline-based regression is also discussed as an extension. Given a large number of knots and a penalty parameter which controls the effective degrees of freedom of a shape-restricted model, penalized methods give smoother fits while balance between under- and over-fitting. A bootstrap confidence interval for a change-point is established. By generating random change-points from a curve on the unit interval, we compute the coverage rate of the bootstrap confidence interval using penalized estimators, which shows advantages such as robustness over competitors. The methods are available in the R package ShapeChange on the Comprehensive R Archival Network (CRAN). Moreover, we discuss the shape selection problem when there are more than one possible shapes for a given data set. A project with the Forest Inventory & Analysis (FIA) scientists is included as an example. In this project, we apply shape-restricted spline-based estimators, among which the one-jump and double-jump estimators are emphasized, to time-series Landsat imagery for the purpose of modeling, mapping, and monitoring annual forest disturbance dynamics. For each pixel and spectral band or index of choice in temporal Landsat data, our method delivers a smoothed rendition of the trajectory constrained to behave in an ecologically sensible manner, reflecting one of seven possible “shapes”. Routines to realize the methodology are built in the R package ShapeSelectForest on CRAN, and techniques in this package are being applied for forest disturbance and attribute mapping across the conterminous U.S.. The Landsat community will implement techniques in this package on the Google Earth Engine in 2016. Finally, we consider the change-point estimation with generalized linear models. Such work can be applied to dose-response analysis, when the effect of a drug increases as the dose increases to a saturation point, after which the effect starts decreasing.Item Open Access Characterization of multiple time-varying transient sources from multivariate data sequences(Colorado State University. Libraries, 2014) Wachowski, Neil, author; Azimi-Sadjadi, Mahmood R., advisor; Breidt, F. Jay, committee member; Fristrup, Kurt, committee member; Pezeshki, Ali, committee memberCharacterization of multiple time-varying transient sources using sequential multivariate data is a broad and complex signal processing problem. In general, this process involves analyzing new observation vectors in a data stream of unknown length to determine if they contain the signatures of a source of interest (i.e., a signal), in which case the source's type and interference-free signatures may be estimated. This process may continue indefinitely to detect and classify several events of interest thereby yielding an aggregate description of the data's contents. Such capabilities are useful in numerous applications that involve continuously observing an environment containing complicated and erratic signals, e.g., habitat monitoring using acoustical data, medical diagnosis via magnetic resonance imaging, and underwater mine hunting using sonar imagery. The challenges associated with successful transient source characterization are as numerous as the application areas, and include 1) significant variations among signatures emitted by a given source type, 2) the presence of multiple types of random yet structured interference sources whose signatures are superimposed with those of signals, 3) a data representation that is not necessarily optimized for the task at hand, 4) variable environmental and operating conditions, and many others. These challenges are compounded by the inherent difficulties associated with processing sequential multivariate data, namely the inability to exploit the statistics or structure of the entire data stream. On the other hand, the complications that must be addressed often vary significantly when considering different types of data, leading to an abundance of existing solutions that are each specialized for a particular application. In other words, most existing work only simultaneously considers a subset of these complications, making them difficult to generalize. The work in this thesis was motivated by an application involving characterization of national park soundscapes in terms of commonly occurring man-made and natural acoustical sources, using streams of "1/3 octave vector'' sequences. Naturally, this application involves developing solutions that consider all of the mentioned challenges, among others. Two comprehensive solutions to this problem were developed, each with unique strengths and weaknesses relative to one another. A sequential random coefficient tracking (SRCT) method was developed first, that hierarchically applies a set of likelihood ratio tests to each incoming vector observation to detect and classify up to one signal and one interference source that may be simultaneously present. Since the signatures of each acoustical event typically span several adjacent observations, a Kalman filter is used to generate the parameters necessary for computing the likelihood values. The SRCT method is also capable of using the coefficient estimates produced by the Kalman filter to generate estimates of both the signal and interference components of the observation, thus performing separation in a dual source scenario. The main benefits of this method are its computational efficiency and its ability to characterize both components of an observation (signal and interference). To address some of the main deficiencies of the SRCT method, a sparse coefficient state tracking (SCST) approach was also developed. This method was designed to detect and classify signals when multiple types of interference are simultaneously present, while avoiding restrictive assumptions concerning the distribution of observation components. This SCST method uses generalized likelihood ratios tests to perform signal detection and classification during quiescent periods, and quiescent detection whenever a signal is present. To form these tests, the likelihood of each signal model is found given a sparse approximation of an incoming observation, which makes the temporal evolution of source signatures more tractable. Robustness to structured interference is incorporated by virtue of the inherent separation capabilities of sparse coding. Each signal model is characterized by a Bayesian network, which captures the dependencies between different coefficients in the sparse approximation under the associated hypothesis. In addition to developing two complete transient source characterization systems, this thesis also introduces several concepts and tools that may be used to aid in the development of new systems designed for similar tasks, or supplement existing ones. Of particular note are a comprehensive overview of existing general approaches for detecting changes in the parameters of sequential data streams, a new method for performing fusion of sequential classification decisions based on a hidden Markov model framework, and a detailed analysis of the 1/3 octave data format mentioned above. The latter is especially helpful since this data format is commonly used in audio analysis applications. A comprehensive study is carried out to evaluate the performance of the developed methods for detecting, classifying, and estimating the signatures of signals using 1/3 octave soundscape data that is corrupted with multiple types of structured interference. The systems are benchmarked against a Gaussian mixture model approach that was adapted to handle the complexities of the soundscape data, as such approaches are frequently used in acoustical source recognition applications. Performance is mainly measured in terms of the receiver operator characteristics (ROC) of the test statistics implemented by each method, the improvement in signal-to-noise ratio they offer when estimating signatures, and their overall ability to accurately detect and classify signals of interest. It was observed that both the SRCT and SCST methods perform exceptionally on the national park soundscape data, though the latter performs best in the presence of heavy interference and is more flexible in new environmental and operating conditions.Item Open Access Constrained spline regression and hypothesis tests in the presence of correlation(Colorado State University. Libraries, 2013) Wang, Huan, author; Meyer, Mary C., advisor; Opsomer, Jean D., advisor; Breidt, F. Jay, committee member; Reich, Robin M., committee memberExtracting the trend from the pattern of observations is always difficult, especially when the trend is obscured by correlated errors. Often, prior knowledge of the trend does not include a parametric family, and instead the valid assumption are vague, such as "smooth" or "monotone increasing," Incorrectly specifying the trend as some simple parametric form can lead to overestimation of the correlation, and conversely, misspecifying or ignoring the correlation leads to erroneous inference for the trend. In this dissertation, we explore spline regression with shape constraints, such as monotonicity or convexity, for estimation and inference in the presence of stationary AR(p) errors. Standard criteria for selection of penalty parameter, such as Akaike information criterion (AIC), cross-validation and generalized cross-validation, have been shown to behave badly when the errors are correlated and in the absence of shape constraints. In this dissertation, correlation structure and penalty parameter are selected simultaneously using a correlation-adjusted AIC. The asymptotic properties of unpenalized spline regression in the presence of correlation are investigated. It is proved that even if the estimation of the correlation is inconsistent, the corresponding projection estimation of the regression function can still be consistent and have the optimal asymptotic rate, under appropriate conditions. The constrained spline fit attains the convergence rate of unconstrained spline fit in the presence of AR(p) errors. Simulation results show that the constrained estimator typically behaves better than the unconstrained version if the true trend satisfies the constraints. Traditional statistical tests for the significance of a trend rely on restrictive assumptions on the functional form of the relationship, e.g. linearity. In this dissertation, we develop testing procedures that incorporate shape restrictions on the trend and can account for correlated errors. These tests can be used in checking whether the trend is constant versus monotone, linear versus convex/concave and any combinations such as, constant versus increase and convex. The proposed likelihood ratio test statistics have an exact null distribution if the covariance matrix of errors is known. Theorems are developed for the asymptotic distributions of test statistics if the covariance matrix is unknown but the test statistics use a consistent estimator of correlation into their estimation. The comparisons of the proposed test with the F-test with the unconstrained alternative fit and the one-sided t-test with simple regression alternative fit are conducted through intensive simulations. Both test size and power of the proposed test are favorable, smaller test size and greater power in general, comparing to the F-test and t-test.Item Open Access Evidence for a rotation in asthenospheric flow in northwest Canada: insights from shear wave splitting(Colorado State University. Libraries, 2021) Bolton, Andrew R., author; Schutt, Derek L., advisor; Aster, Richard C., committee member; Breidt, F. Jay, committee memberThe Mackenzie Mountains (MM) of northwest Canada are an actively uplifting, seismogenic salient of the northern Canadian Cordillera that lie 750 km NE of the nearest plate boundary. We present new shear wave splitting measurements for the region from a linear array which transects the region to characterize upper mantle anisotropy. A gradual rotation in anisotropy occurs across the Canadian Cordillera, with stations nearest to the craton yielding fast axis orientations that are subparallel to North America absolute plate motion (~230°). Moving SW from the craton, across the MM and towards the plate boundary, fast-axis orientations rotate to become aligned with major lithospheric fabrics (NW-SE). Previous work has shown that the Cordilleran lithosphere is thin (~50 km) in this region. We therefore interpret these results to primarily reflect sublithospheric flow. Three subduction-transpressional related hypotheses for flow are presented, where our preferred hypotheses invokes depth-dependent, subduction-induced flow.Item Open Access Identification and characterization of super-spreaders from voluminous epidemiology data(Colorado State University. Libraries, 2016) Shah, Harshil, author; Pallickara, Shrideep, advisor; Pallickara, Sangmi, advisor; Breidt, F. Jay, committee memberPlanning for large-scale epidemiological outbreaks often involves executing compute-intensive disease spread simulations. To capture the probabilities of various outcomes, these simulations are executed several times over a collection of representative input scenarios, producing voluminous data. The resulting datasets contain valuable insights, including sequences of events such as super-spreading events that lead to extreme outbreaks. However, discovering and leveraging such information is also computationally expensive. In this thesis, we propose a distributed approach for analyzing voluminous epidemiology data to locate and classify the super-spreaders in a disease network. Our methodology constructs analytical models using features extracted from the epidemiology data. The analytical models are amenable to interpretation and disease planners can use them to inform identification of super-spreaders that have a disproportionate effect on epidemiological outcomes, enabling effective allocation of limited resources such as vaccinations and field personnel.Item Open Access Inference for functional time series with applications to yield curves and intraday cumulative returns(Colorado State University. Libraries, 2016) Young, Gabriel J., author; Kokoszka, Piotr S., advisor; Miao, Hong, committee member; Breidt, F. Jay, committee member; Zhou, Wen, committee memberEconometric and financial data often take the form of a functional time series. Examples include yield curves, intraday price curves and term structure curves. Before an attempt is made to statistically model or predict such series, we must address whether or not such a series can be assumed stationary or trend stationary. We develop extensions of the KPSS stationarity test to functional time series. Motivated by the problem of a change in the mean structure of yield curves, we also introduce several change point methods applied to dynamic factor models. For all testing procedures, we include a complete asymptotic theory, a simulation study, illustrative data examples, as well as details of the numerical implementation of the testing procedures. The impact of scheduled macroeconomic announcements has been shown to account for sizable fractions of total annual realized stock returns. To assess this impact, we develop methods of derivative estimation which utilize a functional analogue of local-polynomial smoothing. The confidence bands are then used to find time intervals of statistically increasing cumulative returns.Item Open Access Infinite dimensional stochastic inverse problems(Colorado State University. Libraries, 2018) Yang, Lei, author; Estep, Donald, advisor; Breidt, F. Jay, committee member; Tavener, Simon, committee member; Zhou, Wen, committee memberIn many disciplines, mathematical models such as differential equations, are used to characterize physical systems. The model induces a complex nonlinear measurable map from the domain of physical parameters to the range of observable Quantities of Interest (QoI) computed by applying a set of functionals to the solution of the model. Often the parameters can not be directly measured, and people are confronted with the task of inferring information about values of the parameters given the measured or imposed information about the values of the QoI. In such applications, there is generally significant uncertainty in the measured values of the QoI. Uncertainty is often modeled using probability distributions. For example, a probability structure imposed on the domain of the parameters induces a corresponding probability structure on the range of the QoI. This is the well known Stochastic Forward Problem that is typically solved using a variation of the Monte Carlo method. This dissertation is concerned with the Stochastic Inverse Problems (SIP) where the probability distributions are imposed on the range of the QoI, and problem is to compute the induced distributions on the domain of the parameters. In our formulation of the SIP and its generalization for the case where the physical parameters are functions, main topics including the existence, continuity and numerical approximations of the solutions are investigated. Chapter 1 introduces the background and previous research on the SIP. It also gives useful theorems, results and notation used later. Chapter 2 begins by establishing a relationship between Lebesgue measures on the domain and the range, and then studies the form of solution of the SIP and its continuity properties. Chapter 3 proposes an algorithm for computing the solution of the SIP, and discusses the convergence of the algorithm to the true solution. Chapter 4 exploits the fact that a function can be represented by its coefficients with respect to a basis, and extends the SIP framework to allow for cases where the domain representing the basis coefficients is a countable cube with decaying edges, referred to as the infinite dimensional SIP. We then discusses how its solution can be approximated by the SIP for which the domain is the finite dimensional cube obtained by taking a finite dimensional projection of the countable cube. Chapter 5 begins with an algorithm for approximating the solution of the infinite dimensional SIP, and then proves the algorithm converges to the true solution. Chapter 6 gives a numerical example showing the effects of different decay rates and the relation to truncation to finite dimensions. Chapter 7 reviews popular probabilistic inverse problem methods and proposes a combination of the SIP and statistical models to address problems encountered in practice.Item Open Access Joint tail modeling via regular variation with applications in climate and environmental studies(Colorado State University. Libraries, 2013) Weller, Grant B., author; Cooley, Dan, advisor; Breidt, F. Jay, committee member; Estep, Donald, committee member; Schumacher, Russ, committee memberThis dissertation presents applied, theoretical, and methodological advances in the statistical analysis of multivariate extreme values, employing the underlying mathematical framework of multivariate regular variation. Existing theory is applied in two studies in climatology; these investigations represent novel applications of the regular variation framework in this field. Motivated by applications in environmental studies, a theoretical development in the analysis of extremes is introduced, along with novel statistical methodology. This work first details a novel study which employs the regular variation modeling framework to study uncertainties in a regional climate model's simulation of extreme precipitation events along the west coast of the United States, with a particular focus on the Pineapple Express (PE), a special type of winter storm. We model the tail dependence in past daily precipitation amounts seen in observational data and output of the regional climate model, and we link atmospheric pressure fields to PE events. The fitted dependence model is utilized as a stochastic simulator of future extreme precipitation events, given output from a future-scenario run of the climate model. The simulator and link to pressure fields are used to quantify the uncertainty in a future simulation of extreme precipitation events from the regional climate model, given boundary conditions from a general circulation model. A related study investigates two case studies of extreme precipitation from six regional climate models in the North American Regional Climate Change Assessment Program (NARCCAP). We find that simulated winter season daily precipitation along the Pacific coast exhibit tail dependence to extreme events in the observational record. When considering summer season daily precipitation over a central region of the United States, however, we find almost no correspondence between extremes simulated by NARCCAP and those seen in observations. Furthermore, we discover less consistency among the NARCCAP models in the tail behavior of summer precipitation over this region than that seen in winter precipitation over the west coast region. The analyses in this work indicate that the NARCCAP models are effective at downscaling winter precipitation extremes in the west coast region, but questions remain about their ability to simulate summer-season precipitation extremes in the central region. A deficiency of existing modeling techniques based on the multivariate regular variation framework is the inability to account for hidden regular variation, a feature of many theoretical examples and real data sets. One particular example of this deficiency is the inability to distinguish asymptotic independence from independence in the usual sense. This work develops a novel probabilistic characterization of random vectors possessing hidden regular variation as the sum of independent components. The characterization is shown to be asymptotically valid via a multivariate tail equivalence result, and an example is demonstrated via simulation. The sum characterization is employed to perform inference for the joint tail of random vectors possessing hidden regular variation. This dissertation develops a likelihood-based estimation procedure, employing a novel version of the Monte Carlo expectation-maximization algorithm which has been modified for tail estimation. The methodology is demonstrated on simulated data and applied to a bivariate series of air pollution data from Leeds, UK. We demonstrate the improvement in tail risk estimates offered by the sum representation over approaches which ignore hidden regular variation in the data.Item Open Access Leveraging ensembles: balancing timeliness and accuracy for model training over voluminous datasets(Colorado State University. Libraries, 2020) Budgaga, Walid, author; Pallickara, Shrideep, advisor; Pallickara, Sangmi Lee, advisor; Ben-Hur, Asa, committee member; Breidt, F. Jay, committee memberAs data volumes increase, there is a pressing need to make sense of the data in a timely fashion. Voluminous datasets are often high dimensional, with individual data points representing a vector of features. Data scientists fit models to the data—using all features or a subset thereof—and then use these models to inform their understanding of phenomena or make predictions. The performance of these analytical models is assessed based on their accuracy and ability to generalize on unseen data. Several existing frameworks can be used for drawing insights from voluminous datasets. However, there are some inefficiencies associated with these frameworks including scalability, limited applicability beyond a target domain, prolonged training times, poor resource utilization, and insufficient support for combining diverse model fitting algorithms. In this dissertation, we describe our methodology for scalable supervised learning over voluminous datasets. The methodology explores the impact of partitioning the feature space, building models over these partitioned subsets of the data, and their impact on training times and accuracy. Using our methodology, a practitioner can harness a mix of learning methods to build diverse models over the partitioned data. Rather than build a single, all-encompassing model we construct an ensemble of models trained independently over different portions of the dataset. In particular, we rely on concurrent and independent learning from different portions of the data space to overcome the issues relating to resource utilization and completion times associated with distributed training of a single model over the entire dataset. Our empirical benchmarks are performed using datasets from diverse domains, including epidemiology, music, and weather. These benchmarks demonstrate the suitability of our methodology for reducing training times while preserving accuracy in contrast to those obtained from a complex model trained on the entire dataset. In particular, our methodology utilizes resources effectively by amortizing I/O and CPU costs by relying on a distributed environment while ensuring a significant reduction of network traffic during training.Item Open Access Linear system design for compression and fusion(Colorado State University. Libraries, 2013) Wang, Yuan, author; Wang, Haonan, advisor; Scharf, Louis L., advisor; Breidt, F. Jay, committee member; Luo, Rockey J., committee memberThis is a study of measurement compression and fusion design. The idea common to both problems is that measurements can often be linearly compressed into lower-dimensional spaces without introducing too much excess mean-squared error or excess volume in a concentration ellipse. The question is how to design the compression to minimize the excesses at any given dimension. The first part of this work is motivated by sensing and wireless communication, where data compression or dimension reduction may be used to reduce the required communication bandwidth. The high-dimensional measurements are converted into low-dimensional representations through linear compression. Our aim is to compress a noisy measurement, allowing for the fact that the compressed measurement will be transmitted over a noisy channel. We review optimal compression with no transmission noise and show its connection with canonical coordinates. When the compressed measurement is transmitted with noise, we give the closed-form expression for the optimal compression matrix with respect to the trace and determinant of the error covariance matrix. We show that the solutions are canonical coordinate solutions, scaled by coefficients which account for canonical correlations and transmission noise variance, followed by a coordinate transformation into the sub-dominant invariant subspace of the channel noise. The second part of this work is a problem of integrating multiple sources of measurements. We consider two multiple-input-multiple-output channels, a primary channel and a secondary channel, with dependent input signals. The primary channel carries the signal of interest, and the secondary channel carries a signal that shares a joint distribution with the primary signal. The problem of particular interest is designing the secondary channel, with a fixed primary channel. We formulate the problem as an optimization problem, in which the optimal secondary channel maximizes an information-based criterion. An analytic solution is provided in a special case. Two fast-to-compute algorithms, one extrinsic and the other intrinsic, are proposed to approximate the optimal solutions in general cases. In particular, the intrinsic algorithm exploits the geometry of the unit sphere, a manifold embedded in Euclidean space. The performances of the proposed algorithms are examined through a simulation study. A discussion of the choice of dimension for the secondary channel is given, leading to rules for dimension reduction.Item Open Access Low-latency, query-driven analytics over voluminous multidimensional, spatiotemporal datasets(Colorado State University. Libraries, 2017) Malensek, Matthew, author; Pallickara, Shrideep, advisor; Pallickara, Sangmi Lee, advisor; Bohm, A. P. Willem, committee member; Draper, Bruce, committee member; Breidt, F. Jay, committee memberUbiquitous data collection from sources such as remote sensing equipment, networked observational devices, location-based services, and sales tracking has led to the accumulation of voluminous datasets; IDC projects that by 2020 we will generate 40 zettabytes of data per year, while Gartner and ABI estimate 20-35 billion new devices will be connected to the Internet in the same time frame. The storage and processing requirements of these datasets far exceed the capabilities of modern computing hardware, which has led to the development of distributed storage frameworks that can scale out by assimilating more computing resources as necessary. While challenging in its own right, storing and managing voluminous datasets is only the precursor to a broader field of study: extracting knowledge, insights, and relationships from the underlying datasets. The basic building block of this knowledge discovery process is analytic queries, encompassing both query instrumentation and evaluation. This dissertation is centered around query-driven exploratory and predictive analytics over voluminous, multidimensional datasets. Both of these types of analysis represent a higher-level abstraction over classical query models; rather than indexing every discrete value for subsequent retrieval, our framework autonomously learns the relationships and interactions between dimensions in the dataset (including time series and geospatial aspects), and makes the information readily available to users. This functionality includes statistical synopses, correlation analysis, hypothesis testing, probabilistic structures, and predictive models that not only enable the discovery of nuanced relationships between dimensions, but also allow future events and trends to be predicted. This requires specialized data structures and partitioning algorithms, along with adaptive reductions in the search space and management of the inherent trade-off between timeliness and accuracy. The algorithms presented in this dissertation were evaluated empirically on real-world geospatial time-series datasets in a production environment, and are broadly applicable across other storage frameworks.Item Open Access Model selection and nonparametric estimation for regression models(Colorado State University. Libraries, 2014) He, Zonglin, author; Opsomer, Jean, advisor; Breidt, F. Jay, committee member; Meyer, Mary, committee member; Elder, John, committee memberIn this dissertation, we deal with two different topics in statistics. The first topic in survey sampling deals with variable selection for linear regression model from which we will sample with a possibly informative design. Under the assumption that the finite population is generated by a multivariate linear regression model from which we will sample with a possibly informative design, we particularly study the variable selection criterion named predicted residual sums of squares in the sampling context theoretically. We examine the asymptotic properties of weighted and unweighted predicted residual sums of squares under weighted least squares regression estimation and ordinary least squares regression estimation. One simulation study for the variable selection criteria are provided, with the purpose of showing their ability to select the correct model in the practical situation. For the second topic, we are interested in fitting a nonparametric regression model to data for the situation in which some of the covariates are categorical. In the univariate case where the covariate is a ordinal variable, we extend the local polynomial estimator, which normally requires continuous covariates, to a local polynomial estimator that allows for ordered categorical covariates. We derive the asymptotic conditional bias and variance for the local polynomial estimator with ordinal covariate, under the assumption that the categories correspond to quantiles of an unobserved continuous latent variable. We conduct a simulation study with two patterns of ordinal data to evaluate our estimator. In the multivariate case where the covariates contain a mixture of continuous, ordinal, and nominal variables, we use a Nadaraya-Watson estimator with generalized product kernel. We derive the asymptotic conditional bias and variance for the Nadaraya-Watson estimator with continuous, ordinal, and nominal covariates, under the assumption that the categories of the ordinal covariate correspond to quantiles of an unobserved continuous latent variable. We conduct a multivariate simulation study to evaluate our Nadaraya-Watson estimator with generalized product kernel.Item Open Access Penalized isotonic regression and an application in survey sampling(Colorado State University. Libraries, 2016) Wu, Jiwen, author; Opsomer, Jean D., advisor; Meyer, Mary C., advisor; Breidt, F. Jay, committee member; Doherty, Paul, committee memberIn isotonic regression, the mean function is assumed to be monotone increasing (or de- creasing) but otherwise unspecified. The classical isotonic least-squares estimator is known to be inconsistent at boundaries; this is called the spiking problem. A penalty on the range of the regression function is proposed to correct the spiking problem for univariate and mul- tivariate isotonic models. The penalized estimator is shown to be consistent everywhere for a wide range of sizes of the penalty parameter. For the univariate case, the optimal penalty is shown to depend on the derivatives of the true regression function at the boundaries. Pointwise confidence intervals are constructed using the penalized estimator and bootstrap- ping ideas; these are shown through simulations to behave well in moderate sized samples. Simulation studies also show that the power of the hypothesis test of constant versus in- creasing regression function improves substantially compared to the power of the test with unpenalized alternative, and also compares favorably to tests using parametric alternatives. The application of isotonic regression is also considered in the survey context where many variables contain natural orderings that should be respected in the estimates. For instance, the National Compensation Survey estimates mean wages for many job categories, and these mean wages are expected to be non-decreasing according to job level. In this type of situation, isotonic regression can be applied to give constrained estimators satisfying the monotonicity. We combine domain estimation and the pooled adjacent violators algorithm to construct new design-weighted constrained estimators. The resulting estimator is the classical design- based domain estimator but after adaptive pooling of neighboring domains, so that it is both readily implemented in large-scale surveys and easy to explain to data users. Under mild conditions on the sampling design and the population, the estimators are shown to be design consistent and asymptotically normal. Confidence intervals for domain means using linearization-based and replication-based variance estimation show marked improvements compared to survey estimators that do not incorporate the constraints. Furthermore, a cone projection algorithm is implemented in the domain mean estimate to accommodate qualitative constraints in the case of two covariates. Theoretical properties of the constrained estimators have been investigated and a simulation study is used to demonstrate the improvement of confidence interval when using the constrained estimate. We also provide a relaxed monotone constraint to loosen the qualitative assumptions, where the extent of departure from monotonicity can be controlled by a weight function and a chosen bandwidth. We compare the unconstrained estimate, constrained estimate without penalty, constrained estimate with penalty, and the relax constrained estimate. Improvements are found in the confidence interval with higher coverage rates and smaller confidence size when incorporating the constraints, and the penalized version fixes the spiking problem at the boundary.Item Unknown Performance and reliability evaluation of Sacramento demonstration novel ICPC solar collectors(Colorado State University. Libraries, 2012) Daosukho, Jirachote "Pong", author; Duff, William S., advisor; Troxell, Wade O., advisor; Burns, Patrick J., committee member; Breidt, F. Jay, committee memberThis dissertation focuses on the reliability and degradation of the novel integral compound parabolic concentrator (ICPC) evacuated solar collector over a 13 year period. The study investigates failure modes of the collectors and analyzes the effects of those failures on performance. An instantaneous efficiency model was used to calculate performance and efficiencies from the measurements. An animated graphical ray tracing simulation tool was developed to investigate the optical performance of the ICPC for the vertical and horizontal absorber fin orientations. The animated graphical ray tracing allows the user to visualize the propagation of rays through the ICPC optics. The ray tracing analysis also showed that the horizontal fin ICPC's performance was more robust to degradation of the reflective surface. Thermal losses were also a part of the performance calculations. The two main degradation mechanisms are reflectivity degradation due to air leakage and fluid leakage into the vacuum enclosure and loss of vacuum due to leaks through cracks. Reflectivity degradation causes a reduction of optical performance and the loss of vacuum causes a reduction in thermal performance.Item Open Access Regression of network data: dealing with dependence(Colorado State University. Libraries, 2019) Marrs, Frank W., author; Fosdick, Bailey K., advisor; Breidt, F. Jay, committee member; Zhou, Wen, committee member; Wilson, James B., committee memberNetwork data, which consist of measured relations between pairs of actors, characterize some of the most pressing problems of our time, from environmental treaty legislation to human migration flows. A canonical problem in analyzing network data is to estimate the effects of exogenous covariates on a response that forms a network. Unlike typical regression scenarios, network data often naturally engender excess statistical dependence -- beyond that represented by covariates -- due to relations that share an actor. For analyzing bipartite network data observed over time, we propose a new model that accounts for excess network dependence directly, as this dependence is of scientific interest. In an example of international state interactions, we are able to infer the networks of influence among the states, such as which states' military actions are likely to incite other states' military actions. In the remainder of the dissertation, we focus on situations where inference on effects of exogenous covariates on the network is the primary goal of the analysis, and thus, the excess network dependence is a nuisance effect. In this setting, we leverage an exchangeability assumption to propose novel parsimonious estimators of regression coefficients for both binary and continuous network data, and new estimators for coefficient standard errors for continuous network data. The exchangeability assumption we rely upon is pervasive in network and array models in the statistics literature, but not previously considered when adjusting for dependence in a regression of network data. Although the estimators we propose are aligned with many network models in the literature, our estimators are derived from the assumption of exchangeability rather than proposing a particular parametric model for representing excess network dependence in the data.