Browsing by Author "Wang, Haonan, committee member"
Now showing 1 - 20 of 34
Results Per Page
Sort Options
Item Open Access A fiducial approach to extremes and multiple comparisons(Colorado State University. Libraries, 2010) Wandler, Damian V., author; Hannig, Jan, advisor; Iyer, Hariharan K., advisor; Chong, Edwin Kah Pin, committee member; Wang, Haonan, committee memberGeneralized fiducial inference is a powerful tool for many difficult problems. Based on an extension of R. A. Fisher's work, we used generalized fiducial inference for two extreme value problems and a multiple comparison procedure. The first extreme value problem is dealing with the generalized Pareto distribution. The generalized Pareto distribution is relevant to many situations when modeling extremes of random variables. We use a fiducial framework to perform inference on the parameters and the extreme quantiles of the generalized Pareto. This inference technique is demonstrated in both cases when the threshold is a known and unknown parameter. Simulation results suggest good empirical properties and compared favorably to similar Bayesian and frequentist methods. The second extreme value problem pertains to the largest mean of a multivariate normal distribution. Difficulties arise when two or more of the means are simultaneously the largest mean. Our solution uses a generalized fiducial distribution and allows for equal largest means to alleviate the overestimation that commonly occurs. Theoretical calculations, simulation results, and application suggest our solution possesses promising asymptotic and empirical properties. Our solution to the largest mean problem arose from our ability to identify the correct largest mean(s). This essentially became a model selection problem. As a result, we applied a similar model selection approach to the multiple comparison problem. We allowed for all possible groupings (of equality) of the means of k independent normal distributions. Our resulting fiducial probability for the groupings of the means demonstrates the effectiveness of our method by selecting the correct grouping at a high rate.Item Open Access Accurate prediction of protein function using GOstruct(Colorado State University. Libraries, 2011) Sokolov, Artem, author; Ben-Hur, Asa, advisor; Anderson, Chuck, committee member; McConnell, Ross M., committee member; Wang, Haonan, committee memberWith the growing number of sequenced genomes, automatic prediction of protein function is one of the central problems in computational biology. Traditional methods employ transfer of functional annotation on the basis of sequence or structural similarity and are unable to effectively deal with today's noisy high-throughput biological data. Most of the approaches based on machine learning, on the other hand, break the problem up into a collection of binary classification problems, effectively asking the question ''does this protein perform this particular function?''; such methods often produce a set of predictions that are inconsistent with each other. In this work, we present GOstruct, a structured-output framework that answers the question ''what function does this protein perform?'' in the context of hierarchical multilabel classification. We show that GOstruct is able to effectively deal with a large number of disparate data sources from multiple species. Our empirical results demonstrate that the framework achieves state-of-the-art accuracy in two of the recent challenges in automatic function prediction: Mousefunc and CAFA.Item Open Access Advanced Bayesian framework for uncertainty estimation of sediment transport models(Colorado State University. Libraries, 2018) Jung, Jeffrey Youngjai, author; Niemann, Jeffrey D., advisor; Greimann, Blair P., committee member; Julien, Pierre Y., committee member; Wang, Haonan, committee memberNumerical sediment transport models are widely used to forecast the potential changes in rivers that might result from natural and/or human influences. Unfortunately, predictions from those models always possess uncertainty, so that engineers interpret the model results very conservatively, which can lead to expensive over-design of projects. The Bayesian inference paradigm provides a formal way to evaluate the uncertainty in model forecasts originating from uncertain model elements. However, existing Bayesian methods have rarely been used for sediment transport models because they often have large computational times. In addition, past research has not sufficiently addressed ways to treat the uncertainty associated with diverse sediment transport variables. To resolve those limitations, this study establishes a formal and efficient Bayesian framework to assess uncertainty in the predictions from sediment transport models. Throughout this dissertation, new methodologies are developed to represent each of three main uncertainty sources including poorly specified model parameter values, measurement errors contained in the model input data, and imperfect sediment transport equations used in the model structure. The new methods characterize how those uncertain elements affect the model predictions. First, a new algorithm is developed to estimate the parameter uncertainty and its contribution to prediction uncertainty using fewer model simulations. Second, the uncertainties of various input data are described using simple error equations and evaluated within the parameter estimation framework. Lastly, an existing method that can assess the uncertainty related to the selection and application of a transport equation is modified to enable consideration of multiple model output variables. The new methodologies are tested with a one-dimensional sediment transport model that simulates flume experiments and a natural river. Overall, the results show that the new approaches can reduce the computational time about 16% to 55% and produce more accurate estimates (e.g., prediction ranges can cover about 6% to 46% more of the available observations) compared to existing Bayesian methods. Thus, this research enhances the applicability of Bayesian inference for sediment transport modeling. In addition, this study provides several avenues to improve the reliability of the uncertainty estimates, which can help guide interpretation of model results and strategies to reduce prediction uncertainty.Item Open Access Automating investigative pattern detection using machine learning & graph pattern matching techniques(Colorado State University. Libraries, 2022) Muramudalige, Shashika R., author; Jayasumana, Anura P., advisor; Ray, Indrakshi, committee member; Kim, Ryan G., committee member; Wang, Haonan, committee memberIdentification and analysis of latent and emergent behavioral patterns are core tasks in investigative domains such as homeland security, counterterrorism, and crime prevention. Development of behavioral trajectory models associated with radicalization and tracking individuals and groups based on such trajectories are critical for law enforcement investigations, but these are hampered by sheer volume and nature of data that need to be mined and processed. Dynamic and complex behaviors of extremists and extremist groups, missing or incomplete information, and lack of intelligent tools further obstruct counterterrorism efforts. Our research is aimed at developing state-of-the-art computational tools while building on recent advances in machine learning, natural language processing (NLP), and graph databases. In this work, we address the challenges of investigative pattern detection by developing algorithms, tools, and techniques primarily aimed at behavioral pattern tracking and identification for domestic radicalization. The methods developed are integrated in a framework, Investigative Pattern Detection Framework for Counterterrorism (INSPECT). INSPECT includes components for extracting information using NLP techniques, information networks to store in appropriate databases while enabling investigative graph searches, and data synthesis via generative adversarial techniques to overcome limitations due to incomplete and sparse data. These components enable streamlining investigative pattern detection while accommodating various use cases and datasets. While our outcomes are beneficial for law enforcement and counterterrorism applications to counteract the threat of violent extremism, as the results presented demonstrate, the proposed framework is adaptable to diverse behavioral pattern analysis domains such as consumer analytics, cybersecurity, and behavioral health. Information on radicalization activity and participant profiles of interest to investigative tasks are mostly found in disparate text sources. We integrate NLP approaches such as named entity recognition (NER), coreference resolution, and multi-label text classification to extract structured information regarding behavioral indicators, temporal details, and other metadata. We further use multiple text pre-processing approaches to improve the accuracy of data extraction. Our training text datasets are intrinsically smaller and label-wise imbalanced, which hinders direct application of NLP techniques for better results. We use a transfer learning-based, pre-trained NLP model by integrating our specific datasets and achieve noteworthy improvement in information extraction. The extracted information from text sources represents a rich knowledge network of populations with various types of connections that needs to be stored, updated, and repeatedly inspected for emergence of patterns in the long term. Therefore, we utilize graph databases as the foremost storage option while maintaining the reliability and scalability of behavioral data processing. To query suspicious and vulnerable individuals or groups, we implement investigative graph search algorithms as custom stored procedures on top of graph databases while verifying the ability to operate at scale. We use datasets in different contexts to demonstrate the wide-range applicability and the enhanced effectiveness of observing suspicious or latent trends using our investigative graph searches. Investigative data by nature is incomplete and sparse, and the number of cases that may be used for training investigators or machine learning algorithms is small. This is an inherent concern in investigative and many other contexts where the data collection is tedious, available data is limited and also may be subjected to privacy concerns. Having large datasets is beneficial to social scientists and investigative authorities to enhance their skills, and to achieve more accuracy and reliability. A not so small training data volume is also essential for application of the latest machine learning techniques for improved classification and detection. In this work, we propose a generative adversarial network (GAN) based approach with novel feature mapping techniques to synthesize additional data from a small and sparse data set while preserving the statistical characteristics. We also compare our proposed method with two likelihood approaches. i.e., multi-variate Gaussian and regular-vine copulas. We verify the robustness of the proposed technique via a simulation and real-world datasets representing diverse domains. The proposed GAN-based data generation approach is applicable to other domains as demonstrated with two applications. Initially, we extend our data generation approach by contributing to a computer security application resulting in improved phishing websites detection with synthesized datasets. We merge measured datasets with synthesized samples and re-train models to improve the performance of classification models and mitigate vulnerability against adversarial samples. The second was related to a video traffic classification application in which to the data sets are enhanced while preserving statistical similarity between the actual and synthesized datasets. For the video traffic data generation, we modified our data generation technique to capture the temporal patterns in time series data. In this application, we integrate a Wasserstein GAN (WGAN) by using different snapshots of the same video signal with feature-mapping techniques. A trace splitting algorithm is presented for training data of video traces that exhibit higher data throughput with high bursts at the beginning of the video session compared to the rest of the session. With synthesized data, we obtain 5 - 15% accuracy improvement for classification compared to only having actual traces. The INSPECT framework is validated primarily by mining detailed forensic biographies of known jihadists, which are extensively used by social/political scientists. Additionally, each component in the framework is extensively validated with a Human-In-The-Loop (HITL) process, which improves the reliability and accuracy of machine learning models, investigative graph algorithms, and other computing tools based on feedback from social scientists. The entire framework is embedded in a modular architecture where the analytical components are implemented independently and adjustable for different requirements and datasets. We verified the proposed framework's reliability, scalability, and generalizability with datasets in different domains. This research also makes a significant contribution to discrete and sparse data generation in diverse application domains with novel generative adversarial data synthesizing techniques.Item Open Access Cooperative sensing for target estimation and target localization(Colorado State University. Libraries, 2011) Zhang, Wenshu, author; Yang, Liuqing, advisor; Pezeshki, Ali, committee member; Luo, J. Rockey, committee member; Wang, Haonan, committee memberAs a novel sensing scheme, cooperative sensing has drawn great interests in recent years. By utilizing the concept of "cooperation", which incorporates communications and information exchanges among multiple sensing devices, e.g. radar transceivers in radar systems, sensor nodes in wireless sensor networks, or mobile handsets in cellular systems, the sensing capability can achieve significant improvement compared to the conventional noncooperative mode in many aspects. For example, cooperative target estimation is inspired by the benefits of MIMO in communications, where multiple transmit and/or receive antennas can increase the diversity to combat channel fading for enhanced transmission reliability and increase the degrees of freedom for improved data rate. On the other hand, cooperative target localization is able to dramatically increase localization performance in terms of both accuracy and coverage. From the perspective of cooperative target estimation, in this dissertation, we optimize waveforms from multiple cooperative transmitters to facilitate better target estimation in the presence of colored noise. We introduce the normalized MSE (NMSE) minimizing criterion for radar waveform designs. Not only is it more meaningful for parameter estimation problems, but it also exhibits more similar behaviors with the MI criterion than its MMSE counterpart. We also study the robust designs for both the probing waveforms at the transmitter and the estimator at the receiver to address one type of a priori information uncertainties, i.e., in-band target and noise PSD uncertainties. The relationship between MI and MSEs is further investigated through analysis of the sensitivity of the optimum design to the out-band PSD uncertainties as known as the overestimation error. From the perspective of cooperative target localization, in this dissertation, we study the two phases that comprise a localization process, i.e., the distance measurement phase and the location update phase. In the first distance measurement phase, thanks to UWB signals' many desirable features including high delay resolution and obstacle penetration capabilities, we adopt UWB technology for TOA estimation, and then translate the TOA estimate into distance given light propagation speed. We develop a practical data-aided ML timing algorithm and obtain its optimum training sequence. Based on this optimum sequence, the original ML algorithm can be simplified without affecting its optimality. In the second location update phase, we investigate secure cooperative target localization in the presence of malicious attacks, which consists of a fundamental issue in localization problems. We explicitly incorporate anchors' misplacements into distance measurement model and explore the pairwise sparse nature of the misplacements. We formulate the secure localization problem as an ℓ1-regularized least squares (LS) problem and establish the pairwise sparsity upper bound which defines the largest possible number of identifiable malicious anchors. Particularly, it is demonstrated that, with target cooperation, the capability of secure localization is improved in terms of misplacement estimation and target location estimation accuracy compared to the single target case.Item Open Access Data mining and spatiotemporal analysis of modern mobile data(Colorado State University. Libraries, 2019) Fang, Luoyang, author; Yang, Liuqing, advisor; Jayasumana, Anura P., committee member; Luo, Jie, committee member; Wang, Haonan, committee memberModern mobile network technologies and smartphones have successfully penetrated nearly every aspect of human life due to the increasing number of mobile applications and services. Massive mobile data generated by mobile networks with timestamp and location information have been frequently collected. Mobile data analytics has gained remarkable attention from various research communities and industries, since it can broadly reveal the human spatiotemporal mobility patterns from the individual level to an aggregated one. In this dissertation, two types of spatiotemporal modeling with respect to human mobility behaviors are considered, namely the individual modeling and aggregated modeling. As for individual spatiotemporal modeling, location privacy is studied in terms of user identifiability between two mobile datasets, merely based on their spatiotemporal traces from the perspective of a privacy adversary. The success of user identification then hinges upon the effective distance measures via user spatiotemporal behavior profiling. However, user identification methods depending on a single semantic distance measure almost always lead to a large portion of false matches. To improve user identification performance, we propose a scalable multi-feature ensemble matching framework that integrates multiple explored spatiotemporal models. On the other hand, the aggregated spatiotemporal modeling is investigated for network and traffic management in cellular networks. Traffic demand forecasting problem across the entire mobile network is first studied, which is considered as the aggregated behavior of network users. The success of demand forecasting relies on effective modeling of both the spatial and temporal dependencies of the per-cell demand time series. However, the main challenge of the spatial relevancy modeling in the per-cell demand forecasting is the uneven spatial distribution of cells in a network. In this work, a dependency graph is proposed to model the spatial relevancy without compromising the spatial granularity. Accordingly, the spatial and temporal models, graph convolutional and recurrent neural networks, are adopted to forecast the per-cell traffic demands. In addition to demand forecasting, a per-cell idle time window (ITW) prediction application is further studied for predictive network management based on subscribers' aggregated spatiotemporal behaviors. First, the ITW prediction is formulated into a regression problem with an ITW presence confidence index that facilitates direct ITW detection and estimation. To predict the ITW, a deep-learning-based ITW prediction model is proposed, consisting of a representation learning network and an output network. The representation learning network is aimed to learn patterns from the recent history of demand and mobility, while the output network is designed to generate the ITW predicts with the learned representation and exogenous periodic as inputs. Upon this paradigm, a temporal graph convolutional network (TGCN) implementing the representation learning network is also proposed to capture the graph-based spatiotemporal input features effectively.Item Open Access Decision and learning in large networks(Colorado State University. Libraries, 2013) Zhang, Zhenliang, author; Pezeshki, Ali, advisor; Chong, Edwin K. P., advisor; Wang, Haonan, committee member; Luo, Rockey J., committee memberTo view the abstract, please see the full text of the document.Item Open Access Deep learning for radar beam blockage correction(Colorado State University. Libraries, 2023) Tan, Songjian, author; Chen, Haonan, advisor; Chandrasekaran, V., committee member; Wang, Haonan, committee memberThis thesis aims to propose a deep learning framework based on generative adversarial networks (GANs) for correcting partial beam blockage regions in polarimetric radar observations. The correction of such data is an essential step in radar data quality control and subsequent quantitative applications, especially in complex terrain environments. The proposed methodology is demonstrated using two S-band operational Weather Surveillance Radar - 1988 Doppler (WSR-88D) located in different regions of the western United States, characterized by different precipitation types. To train the GAN model, observation sectors of both radars are manually cropped to simulate partial beam blockage situations. The effectiveness of the trained models is demonstrated using independent precipitation events in Texas and California, and their generalization capacity is examined by cross-testing the data with different precipitation features. The beam blockage correction performance is compared with a traditional linear interpolation approach, and the results show that the proposed approach significantly improves the continuity of precipitation observations in both domains. While visible discrepancies exist between the models trained based on convective and stratiform precipitation events in Texas and California, respectively, both models outperform the traditional interpolation method. The repaired observations demonstrate great potential for improved quantitative applications, despite the unavailability of ground truth for real blocked radar data.Item Open Access Distributed medium access control for an enhanced physical-link layer interface(Colorado State University. Libraries, 2020) Heydaryanfroshani, Faeze, author; Luo, Rockey, advisor; Yang, Liuqing, committee member; Pezeshki, Ali, committee member; Wang, Haonan, committee memberCurrent wireless network architecture equips data link layer with binary transmission/idling options and gives the control of choosing other communication parameters to the physical layer. Such a network architecture is inefficient in distributed wireless networks where user coordination can be infeasible or expensive in terms of overhead. To address this issue, an enhancement to the physical-link layer interface is proposed. At the physical layer, the enhanced interface is supported by a distributed channel coding theory, which equips each physical layer user with an ensemble of channel codes. The coding theory allows each transmitter to choose an arbitrary code to encode its message without sharing such a decision with the receiver. The receiver, on the other hand, should decode the messages of interest or report collision depending on whether or not a predetermined reliability threshold can be met. Fundamental limits of the system is characterized asymptotically using a "distributed channel capacity'' when the codeword length can be taken to infinity, and non-asymptotically using an achievable performance bound when the codeword length is finite. The focus of this dissertation is to support the enhanced interface at the data link layer. We assume that each link layer user can be equipped with multiple transmission options each corresponds to a coding option at the physical layer. Each user maintains a transmission probability vector whose entries specify the probability at which the user chooses the corresponding transmission options to transmit its packets. We propose a distributed medium access control (MAC) algorithm for a time-slotted multiple access system with/without enhanced physical-link layer interface to adapt the transmission probability vector of each user to a desired equilibrium that maximizes a chosen network utility. The MAC algorithm is applicable to a general channel model and to a wide range of utility functions. The MAC algorithm falls into the stochastic approximation framework with guaranteed convergence under mild conditions. We developed design procedures to satisfy these conditions and to ensure that the system should converge to a unique equilibrium. Simulation results are provided to demonstrate fast and adaptive convergence behavior of the MAC algorithm as well as the near optimal performance of the designed equilibrium. We then extend the distributed MAC algorithm to support hierarchical primary-secondary user structure in a random multiple access system. The hierarchical user structure is established in the following senses. First, when the number of primary users is small, channel availability is kept above a pre-determined threshold regardless of the number of secondary users that are competing for the channel. Second, when the number of primary users is large, transmission probabilities of the secondary users are automatically driven down to zero. Such a hierarchical structure is achieved without the knowledge of the numbers of primary and secondary users and without direct information exchange among the users. Furthermore, we also investigate distributed MAC for a multiple access system with multiple non-interfering channels. We assume that users are homogeneous but the multiple channels can be heterogeneous. In this case, forcing all users to converge to a homogeneous transmission scheme becomes suboptimal. We extend the distributed MAC algorithm to adaptively assign each user to only one channel and to ensure a balanced load across different channels. While theoretical analysis of the extended MAC algorithm is still incomplete, simulation results show that the algorithm can help users to converge to a near optimal channel assignment solution that maximizes a given network utility.Item Open Access Distributed wireless networking with an enhanced physical-link layer interface(Colorado State University. Libraries, 2019) Tang, Yanru, author; Luo, Rockey, advisor; Yang, Liuqing, committee member; Pezeshki, Ali, committee member; Wang, Haonan, committee memberThis thesis focuses on the cross-layer design of physical and data link layers to support efficient distributed wireless networking. At the physical layer, distributed coding theorems are proposed to prepare each transmitter with an ensemble of channel codes. In a time slot, a transmitter chooses a code to encode its messages and such a choice is not shared with other transmitters or with the receiver. The receiver guarantees either reliable message decoding or reliable collision report depending on whether a predetermined reliability threshold can be met. Under the assumption that the codeword length can be taken to infinity, the distributed capacity of a discrete-time memoryless multiple access channel is derived and is shown to coincide with the classical Shannon capacity region of the same channel. An achievable error performance bound is also presented for the case when codeword length is finite. With the new coding theorems, link layer users can be equipped with multiple transmission options corresponding to the physical layer code ensemble. This enables link layer users to exploit advanced wireless capabilities such as rate and power adaptation, which is not supported in the current network architecture. To gain understandings on how link layer users should efficiently exploit these new capabilities, the corresponding link layer problem is investigated from two different perspectives. Under the assumption that each user is provided with multiple transmission options, the link layer problem is first formulated using a game theoretic model where each user adapts its transmission scheme to maximize a utility function. The condition under which the medium access control game has a unique Nash equilibrium is obtained. Simulation results show that, when multiple transmission options are provided, users in a distributed network tend to converge to channel sharing schemes that are consistent with the well-known information theoretic understandings. A stochastic approximation framework is adopted to further study the link layer problem for the case when each user has a single transmission option as well as the case when each user has multiple transmission options. Assume that each user is backlogged with a saturated message queue. With a generally-modeled channel, a distributed medium access control framework is proposed to adapt the transmission scheme of each user to maximize an arbitrarily chosen symmetric network utility. The proposed framework suggests that the receiver should measure the success probability of a carefully designed virtual packet or a set of virtual packets, and feed such information back to the transmitters. Given channel feedback from the receiver, each transmitter should obtain a user number estimate by comparing the measured success probability with the corresponding theoretical value, and then adapt its transmission scheme accordingly. Conditions under which the proposed algorithm should converge to a designed unique equilibrium are characterized. Simulation results are provided to demonstrate the optimality and the convergence properties of the proposed algorithm.Item Open Access Embodied multimodal referring expressions generation(Colorado State University. Libraries, 2024) Alalyani, Nada H., author; Krishnaswamy, Nikhil, advisor; Ortega, Francisco, committee member; Blanchard, Nathaniel, committee member; Wang, Haonan, committee memberUsing both verbal and non-verbal modalities in generating definite descriptions of objects and locations is a critical human capability in collaborative interactions. Despite advancements in AI, embodied interactive virtual agents (IVAs) are not equipped to intelligently mix modalities to communicate their intents as humans do, which hamstrings naturalistic multimodal IVA. We introduce SCMRE, a situated corpus of multimodal referring expressions (MREs) intended for training generative AI systems in multimodal IVA, focusing on multimodal referring expressions. Our contributions include: 1) Developing an IVA platform that interprets human multimodal instructions and responds with language and gestures; 2) Providing 24 participants with 10 scenes, each involving ten equally-sized blocks randomly placed on a table. These interactions generated a dataset of 10,408 samples; 3) Analyzing SCMRE, revealing that the utilization of pointing significantly reduces the ambiguity of prompts and increases the efficiency of IVA's execution of humans' prompts; 4) Augmenting and synthesizing SCMRE, resulting in 22,159 samples to generate more data for model training; 5) Finetuning LLaMA 2-chat-13B for generating contextually-correct and situationally-fluent multimodal referring expressions; 6) Integrating the fine-tuned model into the IVA to evaluate the success of the generative model-enabled IVA in communication with humans; 7) Establishing the evaluation process which applies to both humans and IVAs and combines quantitative and qualitative metrics.Item Open Access Heterogeneous computing environment characterization and thermal-aware scheduling strategies to optimize data center power consumption(Colorado State University. Libraries, 2012) Al-Qawasmeh, Abdulla, author; Siegel, H. J., advisor; Maciejewski, Anthony A., advisor; Pasricha, Sudeep, committee member; Wang, Haonan, committee memberMany computing systems are heterogeneous both in terms of the performance of their machines and in terms of the characteristics and computational complexity of the tasks that execute on them. Furthermore, different tasks are better suited to execute on specific types of machines. Optimally mapping tasks to machines in a heterogeneous system is, in general, an NP-complete problem. In most cases, heuristics are used to find near-optimal mappings. The performance of allocation heuristics can be affected significantly by factors such as task and machine heterogeneities. In this thesis, different measures are identified to be used in quantifying the heterogeneity of HC systems and the correlation between the performance of the heuristics and these measures is shown. The power consumption of data centers has been increasing at a rapid rate over the past few years. Motivated by the need to reduce the power consumption of data centers, many researchers have been investigating methods to increase the energy efficiency in computing at different levels: chip, server, rack, and data center. Many of today's data centers experience physical limitations on the power needed to run the data center. The first problem that is studied in this thesis is maximizing the performance of a data center that is subject to total power consumption and thermal constraints. A power model for a data center that includes power consumed in both Computer Room Air Conditioning (CRAC) units and compute nodes is considered. The approach in this thesis quantifies the performance of the data center as the total reward collected from completing tasks in a workload by their individual deadlines. The second problem that is studied in this research is how to minimize the power consumption in a data center while guaranteeing that the overall performance does not drop below a specified threshold. For both problems, novel optimization techniques for assigning the performance states of compute cores at the data center level to optimize the operation of the data center are developed. The assignment techniques are divided into two stages. The first stage assigns the P-states of cores, the desired number of tasks per unit time allocated to a core, and the outlet CRAC temperatures. The second stage assigns individual tasks as they arrive at the data center to cores so that the actual number of tasks per unit time allocated to a core approaches the desired number set by the first stage.Item Open Access Improving radar quantitative precipitation estimation through optimizing radar scan strategy and deep learning(Colorado State University. Libraries, 2024) Wang, Liangwei, author; Chen, Haonan, advisor; Chandrasekaran, Venkatchalam, committee member; Wang, Haonan, committee memberAs radar technology plays a crucial role in various applications, including weather forecasting and military surveillance, understanding the impact of different radar scan elevation angles is paramount to optimize radar performance and enhance its effectiveness. The elevation angle, which refers to the vertical angle at which the radar beam is directed, significantly influences the radar's ability to detect, track, and identify targets. The effect of different elevation angles on radar performance depends on factors such as radar type, operating environment, and target characteristics. To illustrate the impact of lowering the minimum scan elevation angle on surface rainfall mapping, this article focuses on the KMUX WSR-88D radar in Northern California as an example, within the context of the National Weather Service's efforts to upgrade its operational Weather Surveillance Radar. By establishing polarimetric radar rainfall relations using local disdrometer data, the study aims to estimate surface rainfall from radar observations, with a specific emphasis on shallow orographic precipitation. The findings indicate that a lower scan elevation angle yields superior performance, with a significant 16.1% improvement in the normalized standard error and a 19.5% enhancement in the Pearson correlation coefficient, particularly for long distances from the radar. In addition, conventional approaches to radar rainfall estimation have limitations, recent studies have demonstrated that deep learning techniques can mitigate parameterization errors and enhance precipitation estimation accuracy. However, training a model that can be applied to a broad domain poses a challenge. To address this, the study leverages crowdsourced data from NOAA and SFL, employing a convolutional neural network with a residual block to transfer knowledge learned from one location to other domains characterized by different precipitation properties. The experimental results showcase the efficacy of this approach, highlighting its superiority over conventional fixed-parameter rainfall algorithms. Machine learning methods have shown promising potential in improving the accuracy of quantitative precipitation estimation (QPE), which is critical in hydrology and meteorology. While significant progress has been made in applying machine learning to QPE, there is still ample room for further research and development. Future endeavors in machine learning-based QPE will primarily focus on enhancing model accuracy, reliability, and interpretability while considering practical operational applications in hydrology and meteorology.Item Open Access Low rank representations of matrices using nuclear norm heuristics(Colorado State University. Libraries, 2014) Osnaga, Silvia Monica, author; Kirby, Michael, advisor; Peterson, Chris, advisor; Bates, Dan, committee member; Wang, Haonan, committee memberThe pursuit of low dimensional structure from high dimensional data leads in many instances to the finding the lowest rank matrix among a parameterized family of matrices. In its most general setting, this problem is NP-hard. Different heuristics have been introduced for approaching the problem. Among them is the nuclear norm heuristic for rank minimization. One aspect of this thesis is the application of the nuclear norm heuristic to the Euclidean distance matrix completion problem. As a special case, the approach is applied to the graph embedding problem. More generally, semi-definite programming, convex optimization, and the nuclear norm heuristic are applied to the graph embedding problem in order to extract invariants such as the chromatic number, Rn embeddability, and Borsuk-embeddability. In addition, we apply related techniques to decompose a matrix into components which simultaneously minimize a linear combination of the nuclear norm and the spectral norm. In the case when the Euclidean distance matrix is the distance matrix for a complete k-partite graph it is shown that the nuclear norm of the associated positive semidefinite matrix can be evaluated in terms of the second elementary symmetric polynomial evaluated at the partition. We prove that for k-partite graphs the maximum value of the nuclear norm of the associated positive semidefinite matrix is attained in the situation when we have equal number of vertices in each set of the partition. We use this result to determine a lower bound on the chromatic number of the graph. Finally, we describe a convex optimization approach to decomposition of a matrix into two components using the nuclear norm and spectral norm.Item Open Access Machine learning-based phishing detection using URL features: a comprehensive review(Colorado State University. Libraries, 2023) Asif, Asif Uz Zaman, author; Ray, Indrakshi, advisor; Shirazi, Hossein, advisor; Ray, Indrajit, committee member; Wang, Haonan, committee memberIn a social engineering attack known as phishing, a perpetrator sends a false message to a victim while posing as a trusted representative in an effort to collect private information such as login passwords and financial information for personal gain. To successfully carry out a phishing attack, fraudulent websites, emails, and messages that are counterfeit are utilized to trick the victim. Machine learning appears to be a promising technique for phishing detection. Typically, website content and Unified Resource Locator (URL) based features are used. However, gathering website content features requires visiting malicious sites, and preparing the data is labor-intensive. Towards this end, researchers are investigating if URL-only information can be used for phishing detection. This approach is lightweight and can be installed at the client's end, they do not require data collection from malicious sites and can identify zero-day attacks. We conduct a systematic literature review on URL-based phishing detection. We selected recent papers (2018 --) or if they had a high citation count (50+ in Google Scholar) that appeared in top conferences and journals in cybersecurity. This survey will provide researchers and practitioners with information on the current state of research on URL-based website phishing attack detection methodologies. The results of this study show that, despite the lack of a centralized dataset, this is beneficial because it prevents attackers from seeing the features that classifiers employ. However, the approach is time-consuming for researchers. Furthermore, for algorithms, both machine learning and deep learning algorithms can be utilized since they have very good classification accuracy, and in this work, we found that Random Forest and Long Short-Term Memory are good choices of algorithms. Using task-specific lexical characteristics rather than concentrating on the number of features is essential for this work because feature selection will impact how accurately algorithms will detect phishing URLs.Item Open Access Methods for extremes of functional data(Colorado State University. Libraries, 2018) Xiong, Qian, author; Kokoszka, Piotr S., advisor; Cooley, Daniel, committee member; Pinaud, Olivier, committee member; Wang, Haonan, committee memberMotivated by the problem of extreme behavior of functional data, we develop statistical theory at the nexus of functional data analysis (FDA) and extreme value theory (EVT). A fundamental technique of functional data analysis is to replace infinite dimensional curves with finite dimensional representations in terms of functional principal components (FPCs). The coefficients of these projections, called the scores, encode the shapes of the curves. Therefore, the study of the extreme behavior of functional time series can be transformed to the study on functional principal component scores. We first derive two tests of significance of the slope function using functional principal components and their empirical counterparts (EFPC's). Applied to tropical storm data, these tests show a significant trend in the annual pattern of upper wind speed levels of hurricanes. Then we establish sufficient conditions under which the asymptotic extreme behavior of the multivariate estimated scores is the same as that of the population scores. We clarify these issues, including the rate of convergence, for Gaussian functions and for more general functional time series whose projections are in the Gumbel domain of attraction. Finally, we derive the asymptotic distribution of the sample covariance operator and of the sample functional principal components for functions which are regularly varying and whose fourth moment does not exist. The new theory is applied to establish the consistency of the regression operator in a functional linear model, with such errors.Item Open Access Nonparametric tests for informative selection and small area estimation for reconciling survey estimates(Colorado State University. Libraries, 2020) Liu, Teng, author; Breidt, F. Jay, advisor; Wang, Haonan, committee member; Estep, Donald J., committee member; Doherty, Paul F., Jr., committee memberTwo topics in the analysis of complex survey data are addressed: testing for informative selection and addressing temporal discontinuities due to survey redesign. Informative selection, in which the distribution of response variables given that they are sampled is different from their distribution in the population, is pervasive in modern complex surveys. Failing to take such informativeness into account could produce severe inferential errors, such as biased parameter estimators, wrong coverage rates of confidence intervals, incorrect test statistics, and erroneous conclusions. While several parametric procedures exist to test for informative selection in the survey design, it is often hard to check the parametric assumptions on which those procedures are based. We propose two classes of nonparametric tests for informative selection, each motivated by a nonparametric test for two independent samples. The first nonparametric class generalizes classic two-sample tests that compare empirical cumulative distribution functions, including Kolmogorov–Smirnov and Cramér–von Mises, by comparing weighted and unweighted empirical cumulative distribution functions. The second nonparametric class adapts two-sample tests that compare distributions based on the maximum mean discrepancy to the setting of weighted and unweighted distributions. The asymptotic distributions of both test statistics are established under the null hypothesis of noninformative selection. Simulation results demonstrate the usefulness of the asymptotic approximations, and show that our tests have competitive power with parametric tests in a correctly specified parametric setting while achieving greater power in misspecified scenarios. Many surveys face the problem of comparing estimates obtained with different methodology, including differences in frames, measurement instruments, and modes of delivery. Differences may exist within the same survey; for example, multi-mode surveys are increasingly common. Further, it is inevitable that surveys need to be redesigned from time to time. Major redesign of survey processes could affect survey estimates systematically, and it is important to quantify and adjust for such discontinuities between the designs to ensure comparability of estimates over time. We propose a small area estimation approach to reconcile two sets of survey estimates, and apply it to two surveys in the Marine Recreational Information Program (MRIP). We develop a log-normal model for the estimates from the two surveys, accounting for temporal dynamics through regression on population size and state-by-wave seasonal factors, and accounting in part for changing coverage properties through regression on wireless telephone penetration. Using the estimated design variances, we develop a regression model that is analytically consistent with the log-normal mean model. We use the modeled design variances in a Fay-Herriot small area estimation procedure to obtain empirical best linear unbiased predictors of the reconciled effort estimates for all states and waves, and provide an asymptotically valid mean square error approximation.Item Open Access Pandemic perceptions: analyzing sentiment in COVID-19 tweets(Colorado State University. Libraries, 2023) Bashir, Shadaab Kawnain, author; Ray, Indrakshi, advisor; Shirazi, Hossein, advisor; Wang, Haonan, committee memberSocial media, particularly Twitter, became the center of public discourse during the COVID-19 global crisis, shaping narratives and perceptions. Recognizing the critical need for a detailed examination of this digital interaction, our research dives into the mechanics of pandemic-related Twitter conversations. This study seeks to understand the many dynamics and effects at work in disseminating COVID-19 information by analyzing and comparing the response patterns displayed by tweets from influential individuals and organizational accounts. To meet the research goals, we gathered a large dataset of COVID-19-related Tweets during the pandemic, which was then meticulously manually annotated. In this work, task-specific transformers and LLM models are used to provide tools for analyzing the digital effects of COVID-19 on sentiment analysis. By leveraging domain-specific models RoBERTa[Twitter] fine-tuned on social media data, this research improved performance in critical task of sentiment analysis. Investigation demonstrates individuals express subjective feelings more frequently compared to organizations. Organizations, however, disseminate more pandemic content in general.Item Open Access Penalized unimodal spline density estimate with application to M-estimation(Colorado State University. Libraries, 2020) Chen, Xin, author; Meyer, Mary C., advisor; Wang, Haonan, committee member; Kokoszka, Piotr, committee member; Zhou, Wen, committee member; Miao, Hong, committee memberThis dissertation establishes a novel type of robust estimation, Auto-Adaptive M-estimation (AAME), based on a new density estimation. The new robust estimation, AAME, is highly data-driven, without the need of priori of the error distribution. It presents improved performance against fat-tailed or highly-contaminated errors over existing M-estimators, by down-weighting influential outliers automatically. It is shown to be root-n consistent, and has an asymptotically normal sampling distribution which provides asymptotic confidence intervals and the basis of robust prediction intervals. The new density estimation is a penalized unimodal spline density estimation which is established as a basis for AAME. It is constrained to be unimodal, symmetrical, and integrate to 1, and it is penalized to have stabilized derivatives and against over-fitting, overall satisfying the requirements of being applied in AAME. The new density estimation is shown to be consistent, and its optimal asymptotic convergence rate can be obtained when the penalty is asymptotically bounded. We also extend our AAME to linear models with heavy-tailed and dependent errors. The dependency of errors is modeled by an autoregressive process, and parameters are estimated jointly.Item Open Access Phishing detection using machine learning(Colorado State University. Libraries, 2021) Shirazi, Hossein, author; Ray, Indrakshi, advisor; Anderson, Chuck, advisor; Malaiya, Yashwant K., committee member; Wang, Haonan, committee memberOur society, economy, education, critical infrastructure, and other aspects of our life have become largely dependent on cyber technology. Thus, cyber threats now endanger various aspects of our daily life. Phishing attacks, even with sophisticated detection algorithms, are still the top Internet crime by victim count in 2020. Adversaries learn from their previous attempts to (i) improve attacks and lure more victims and (ii) bypass existing detection algorithms to steal user's identities and sensitive information to increase their financial gain. Machine learning appears to be a promising approach for phishing detection and, classification algorithms distinguish between legitimate and phishing websites. While machine learning algorithms have shown promising results, we observe multiple limitations in existing algorithms. Current algorithms do not preserve the privacy of end-users due to inquiring third-party services. There is a lack of enough phishing samples for training machine learning algorithms and, over-represented targets have a bias in existing datasets. Finally, adversarial sampling attacks degrade the performance of detection models. We propose four sets of solutions to address the aforementioned challenges. We first propose a domain-name-based phishing detection solution that focuses solely on the domain name of websites to distinguish phishing websites from legitimate ones. This approach does not use any third-party services and preserves the privacy of end-users. We then propose a fingerprinting algorithm that consists of finding similarities (using both visual and textual characteristics) between a legitimate targeted website and a given suspicious website. This approach addresses the issue of bias towards over-represented samples in the datasets. Finally, we explore the effect of adversarial sampling attacks on phishing detection algorithms in-depth, starting with feature manipulation strategies. Results degrade the performance of the classification algorithm significantly. In the next step, we focus on two goals of improving the performance of classification algorithms by increasing the size of used datasets and making the detection algorithm robust against adversarial sampling attacks using an adversarial autoencoder.