Browsing by Author "Pallickara, Shrideep, advisor"
Now showing 1 - 20 of 23
Results Per Page
Sort Options
Item Open Access A framework for real-time, autonomous anomaly detection over voluminous time-series geospatial data streams(Colorado State University. Libraries, 2014) Budgaga, Walid, author; Pallickara, Shrideep, advisor; Pallickara, Sangmi Lee, advisor; Ben-Hur, Asa, committee member; Schumacher, Russ, committee memberIn this research work we present an approach encompassing both algorithm and system design to detect anomalies in data streams. Individual observations within these streams are multidimensional, with each dimension corresponding to a feature of interest. We consider time-series geospatial datasets generated by remote and in situ observational devices. Three aspects make this problem particularly challenging: (1) the cumulative volume and rates of data arrivals, (2) anomalies evolve over time, and (3) there are spatio-temporal correlations associated with the data. Therefore, anomaly detections must be accurate and performed in real time. Given the data volumes involved, solutions must minimize user intervention and be amenable to distributed processing to ensure scalability. Our approach achieves accurate, high throughput classications in real time. We rely on Expectation Maximization (EM) to build Gaussian Mixture Models (GMMs) that model the densities of the training data. Rather than one all-encompassing model, our approach involves multiple model instances, each of which is responsible for a particular geographical extent and can also adapt as data evolves. We have incorporated these algorithms into our distributed storage platform, Galileo, and proled their suitability through empirical analysis which demonstrates high throughput (10,000 observations per-second, per-node) and low latency on real-world datasets.Item Open Access A framework for resource efficient profiling of spatial model performance(Colorado State University. Libraries, 2022) Carlson, Caleb, author; Pallickara, Shrideep, advisor; Pallickara, Sangmi Lee, advisor; Adams, Henry, committee memberWe design models to understand phenomena, make predictions, and/or inform decision-making. This study targets models that encapsulate spatially evolving phenomena. Given a model M, our objective is to identify how well the model predicts across all geospatial extents. A modeler may expect these validations to occur at varying spatial resolutions (e.g., states, counties, towns, census tracts). Assessing a model with all available ground-truth data is infeasible due to the data volumes involved. We propose a framework to assess the performance of models at scale over diverse spatial data collections. Our methodology ensures orchestration of validation workloads while reducing memory strain, alleviating contention, enabling concurrency, and ensuring high throughput. We introduce the notion of a validation budget that represents an upper-bound on the total number of observations that are used to assess the performance of models across spatial extents. The validation budget attempts to capture the distribution characteristics of observations and is informed by multiple sampling strategies. Our design allows us to decouple the validation from the underlying model-fitting libraries to interoperate with models designed using different libraries and analytical engines; our advanced research prototype currently supports Scikit-learn, PyTorch, and TensorFlow. We have conducted extensive benchmarks that demonstrate the suitability of our methodology.Item Open Access Achieving high-throughput distributed, graph-based multi-stage stream processing(Colorado State University. Libraries, 2015) Suriarachchi, Amila, author; Pallickara, Shrideep, advisor; Pallickara, Sangmi Lee, committee member; Venkatachalam, Chandrasekaran, committee memberProcessing complex computations on high volume streaming data in real time is a challenge for many organizational data processing systems. Such systems should produce results with low latency while processing billions of messages daily. In order to address these requirements distributed stream processing systems have been developed. Although high performance is one of the main goals of these systems, there is less attention has been paid for inter node communication performance which is a key aspect to achieve overall system performance. In this thesis we describe a framework for enhancing inter node communication efficiency. We compare performance of our system with Twitter Storm and Yahoo S4 using an implementation of Pan Tompkins algorithm which is used to detect QRS complexities of an ECG signal using a 2 node graph. Our results show our solution performs 4 times better than other systems. We also use four level node graph which is used to process smart plug data to test the performance of our system for a complex graph. Finally we demonstrate how our system is scalable and resilient to faults.Item Open Access Aperture: a system for interactive visualization of voluminous geospatial data(Colorado State University. Libraries, 2020) Bruhwiler, Kevin, author; Pallickara, Shrideep, advisor; Pallickara, Sangmi Lee, advisor; Ghosh, Sudipto, committee member; Chandrasekaran, Venkatachalam, committee memberThe growth in observational data volumes over the past decade has occurred alongside a need to make sense of the phenomena that underpin them. Visualization is a key component of the data wrangling process that precedes the analyses that informs these insights. The crux of this study is interactive visualizations of spatiotemporal phenomena from voluminous datasets. Spatiotemporal visualizations of voluminous datasets introduce challenges relating to interactivity, overlaying multiple datasets and dynamic feature selection, resource capacity constraints, and scaling. Our methodology to address these challenges relies on a novel mix of algorithms and systems innovations working in concert to ensure effective apportioning and amortization of workloads and enables interactivity during visualizations. In particular our research prototype, Aperture, leverages sketching algorithms, effective query predicate generation and evaluation, avoids performance hotspots, harnesses coprocessors for hardware acceleration, and convolutional neural network based encoders to render visualizations while preserving responsiveness and interactivity. Finally, we also explore issues in effective containerization to support visualization workloads. We also report on several empirical benchmarks that profile and demonstrate the suitability of our methodology to preserve interactivity while utilizing resources effectively to scale.Item Open Access Autonomous management of cost, performance, and resource uncertainty for migration of applications to infrastructure-as-a-service (IaaS) clouds(Colorado State University. Libraries, 2014) Lloyd, Wes J., author; Pallickara, Shrideep, advisor; Arabi, Mazdak, committee member; Bieman, James, committee member; David, Olaf, committee member; Massey, Daniel, committee memberInfrastructure-as-a-Service (IaaS) clouds abstract physical hardware to provide computing resources on demand as a software service. This abstraction leads to the simplistic view that computing resources are homogeneous and infinite scaling potential exists to easily resolve all performance challenges. Adoption of cloud computing, in practice however, presents many resource management challenges forcing practitioners to balance cost and performance tradeoffs to successfully migrate applications. These challenges can be broken down into three primary concerns that involve determining what, where, and when infrastructure should be provisioned. In this dissertation we address these challenges including: (1) performance variance from resource heterogeneity, virtualization overhead, and the plethora of vaguely defined resource types; (2) virtual machine (VM) placement, component composition, service isolation, provisioning variation, and resource contention for multitenancy; and (3) dynamic scaling and resource elasticity to alleviate performance bottlenecks. These resource management challenges are addressed through the development and evaluation of autonomous algorithms and methodologies that result in demonstrably better performance and lower monetary costs for application deployments to both public and private IaaS clouds. This dissertation makes three primary contributions to advance cloud infrastructure management for application hosting. First, it includes design of resource utilization models based on step-wise multiple linear regression and artificial neural networks that support prediction of better performing component compositions. The total number of possible compositions is governed by Bell's Number that results in a combinatorially explosive search space. Second, it includes algorithms to improve VM placements to mitigate resource heterogeneity and contention using a load-aware VM placement scheduler, and autonomous detection of under-performing VMs to spur replacement. Third, it describes a workload cost prediction methodology that harnesses regression models and heuristics to support determination of infrastructure alternatives that reduce hosting costs. Our methodology achieves infrastructure predictions with an average mean absolute error of only 0.3125 VMs for multiple workloads.Item Open Access Containerization of model fitting workloads over spatial datasets(Colorado State University. Libraries, 2021) Warushavithana, Menuka, author; Pallickara, Shrideep, advisor; Pallickara, Sangmi, advisor; Breidt, Jay, committee memberSpatial data volumes have grown exponentially over the past several years. The number of domains in which spatial data are extensively leveraged include atmospheric sciences, environmental monitoring, ecological modeling, epidemiology, sociology, commerce, and social media among others. These data are often used to understand phenomena and inform decision making by fitting models to them. In this study, we present our methodology to fit models at scale over spatial data. Our methodology encompasses segmentation, spatial similarity based on the dataset(s) under consideration, and transfer learning schemes that are informed by the spatial similarity to train models faster while utilizing fewer resources. We consider several model fitting algorithms and execution within containerized environments as we profile the suitability of our methodology. Our benchmarks validate the suitability of our methodology to facilitate faster, resource-efficient training of models over spatial data.Item Open Access Distributed algorithms for the orchestration of stochastic discrete event simulations(Colorado State University. Libraries, 2014) Sui, Zhiquan, author; Pallickara, Shrideep, advisor; Anderson, Charles, committee member; Böhm, Wim, committee member; Hayne, Stephen, committee memberDiscrete event simulations are widely used in modeling real-world phenomena such as epidemiology, congestion analysis, weather forecasting, economic activity, and chemical reactions. The expressiveness of such simulations depends on the number and types of entities that are modeled and also the interactions that entities have with each other. In the case of stochastic simulations, these interactions are based on the concomitant probability density functions. The more exhaustively a phenomena is modeled, the greater its computational complexity and, correspondingly, the execution time. Distributed orchestration can speed-up such complex simulations. This dissertation considers the problem of distributed orchestration of stochastic discrete event simulations where the computations are irregular and the processing loads stochastic. We have designed a suite of algorithms that target alleviating imbalances between processing elements across synchronization time steps. The algorithms explore different aspects of the orchestration spectrum: static vs. dynamic, reactive vs. proactive, and deterministic vs. learning-based. The feature vector that guides our algorithms include externally observable features of the simulation such as computational footprints and hardware profiles, and features internal to the simulation such as entity states. The learning structure includes basic version of Artificial Neural Network (ANN) and an improved version of ANN. The algorithms are self-tuning and account for the state of the simulation and processing elements while coping with prediction errors. Finally, these algorithms address resource uncertainty as well. Resource uncertainty in such settings occurs due to resource failures, slowdowns, and heterogeneity. Task apportioning, speculative tasks to cope with stragglers, and checkpointing account for the quality and state of both the resource and simulation. The algorithms achieve demonstrably good performance. Despite the irregular nature of these computations, stochasticity in the processing loads, and resource uncertainty execution times are reduced by a factor of 1.8 when the number of resources is doubled.Item Open Access Harnessing spatiotemporal data characteristics to facilitate large-scale analytics over voluminous, high-dimensional observational datasets(Colorado State University. Libraries, 2021) Rammer, Daniel P., author; Pallickara, Shrideep, advisor; Pallickara, Sangmi Lee, advisor; Ghosh, Sudipto, committee member; Breidt, Jay, committee memberSpatiotemporal data volumes have increased exponentially alongside a need to extract knowledge from them. We propose a methodology, encompassing a suite of algorithmic and systems innovations, to accomplish spatiotemporal data analysis at scale. Our methodology partitions and distributes data to reconcile the competing pulls of dispersion and load balancing. The dispersion schemes are informed by designing distributed data structures to organize metadata in support of expressive query evaluations and high-throughput data retrievals. Targeted, sequential disk block accesses and data sketching techniques are leveraged for effective retrievals. We facilitate seamless integration into data processing frameworks and analytical engines by building compliance for the Hadoop Distributed File System. A refinement of our methodology supports memory-residency and dynamic materialization of data (or subsets thereof) as DataFrames, Datasets, and Resilient Distributed Datasets. These refinements are backed by speculative prefetching schemes that manage speed differentials across the data storage hierarchy. We extend the data-centric view of our methodology to orchestration of deep learning workloads while preserving accuracy and ensuring faster completion times. Finally, we assess the suitability of our methodology using diverse high-dimensional datasets, myriad model fitting algorithms (including ensemble methods and deep neural networks), and multiple data processing frameworks such as Hadoop, Spark, TensorFlow, and PyTorch.Item Open Access Horizontal scaling of video conferencing applications in virtualized environments(Colorado State University. Libraries, 2016) Luo, Mante, author; Pallickara, Shrideep, advisor; Papadopoulos, Christos, committee member; Turk, Daniel, committee memberVideo conferencing is one of the most widely used services in the world. However, it usually requires dedicated hardware and expensive licenses. Cloud computing has helped many companies achieve lower operation costs, and many applications including video conferencing are being transitioned into the cloud. However, most video-conferencing applications do not support horizontal scaling as a built-in feature, which is essential to embrace the advantages of virtualized environments. The objective of this thesis is to explore horizontal scaling of video conferencing applications. We explore these ideas in the context of a Jitsi an open-source video-conferencing. The thesis develops a methodology for horizontal scaling in the Amazon EC2 cloud with the objective of ensuring quality of service such as per-packet latency (primarily), loss rates, jitter, and the number of participants per session. We build predictive models to inform our horizontal scaling decisions. Proactive scaling allows us to preserve several qualities of service metrics for video-conferencing. Scaling in the EC2 environment is fast and cost-effective with the added benefit of high availability, which helps us support large number of users consistently without much downtime.Item Open Access Identification and characterization of super-spreaders from voluminous epidemiology data(Colorado State University. Libraries, 2016) Shah, Harshil, author; Pallickara, Shrideep, advisor; Pallickara, Sangmi, advisor; Breidt, F. Jay, committee memberPlanning for large-scale epidemiological outbreaks often involves executing compute-intensive disease spread simulations. To capture the probabilities of various outcomes, these simulations are executed several times over a collection of representative input scenarios, producing voluminous data. The resulting datasets contain valuable insights, including sequences of events such as super-spreading events that lead to extreme outbreaks. However, discovering and leveraging such information is also computationally expensive. In this thesis, we propose a distributed approach for analyzing voluminous epidemiology data to locate and classify the super-spreaders in a disease network. Our methodology constructs analytical models using features extracted from the epidemiology data. The analytical models are amenable to interpretation and disease planners can use them to inform identification of super-spreaders that have a disproportionate effect on epidemiological outcomes, enabling effective allocation of limited resources such as vaccinations and field personnel.Item Open Access Implications of storage subsystem interactions on processing efficiency in data intensive computing(Colorado State University. Libraries, 2015) Koneru, Hanisha, author; Pallickara, Shrideep, advisor; Pallickara, Sangmi, committee member; Arabi, Mazdak, committee memberProcessing frameworks such as MapReduce allow development of programs that operate on voluminous on-disk data. These frameworks typically include support for multiple file/storage subsystems. This decoupling of processing frameworks from the underlying storage subsystem provides a great deal of flexibility in application development. However, as we demonstrate, this flexibility often exacts a price: performance. Given the data volumes, storage subsystems (such as HDFS, MongoDB, and HBase) disperse datasets over a collection of machines. Storage subsystems manage complexity relating to preservation of consistency, redundancy, failure recovery, throughput, and load balancing. Preserving these properties involve message exchanges between distributed subsystem components, updates to in-memory data structures, data movements, and coordination as datasets are staged and system conditions change. Storage subsystems prioritize these properties differently, leading to vastly different network, disk, memory, and CPU footprints for staging and accessing the same dataset. This thesis proposes a methodology for comparing and identifying the storage subsystem suited for the processing that is being performed on a dataset. We profile the network I/O, disk I/O, memory, and CPU costs introduced by a storage subsystem during data staging, data processing, and generation of results. We perform this analysis with different storage subsystems and applications with different disk-I/O to CPU processing ratios.Item Open Access Leveraging ensembles: balancing timeliness and accuracy for model training over voluminous datasets(Colorado State University. Libraries, 2020) Budgaga, Walid, author; Pallickara, Shrideep, advisor; Pallickara, Sangmi Lee, advisor; Ben-Hur, Asa, committee member; Breidt, F. Jay, committee memberAs data volumes increase, there is a pressing need to make sense of the data in a timely fashion. Voluminous datasets are often high dimensional, with individual data points representing a vector of features. Data scientists fit models to the data—using all features or a subset thereof—and then use these models to inform their understanding of phenomena or make predictions. The performance of these analytical models is assessed based on their accuracy and ability to generalize on unseen data. Several existing frameworks can be used for drawing insights from voluminous datasets. However, there are some inefficiencies associated with these frameworks including scalability, limited applicability beyond a target domain, prolonged training times, poor resource utilization, and insufficient support for combining diverse model fitting algorithms. In this dissertation, we describe our methodology for scalable supervised learning over voluminous datasets. The methodology explores the impact of partitioning the feature space, building models over these partitioned subsets of the data, and their impact on training times and accuracy. Using our methodology, a practitioner can harness a mix of learning methods to build diverse models over the partitioned data. Rather than build a single, all-encompassing model we construct an ensemble of models trained independently over different portions of the dataset. In particular, we rely on concurrent and independent learning from different portions of the data space to overcome the issues relating to resource utilization and completion times associated with distributed training of a single model over the entire dataset. Our empirical benchmarks are performed using datasets from diverse domains, including epidemiology, music, and weather. These benchmarks demonstrate the suitability of our methodology for reducing training times while preserving accuracy in contrast to those obtained from a complex model trained on the entire dataset. In particular, our methodology utilizes resources effectively by amortizing I/O and CPU costs by relying on a distributed environment while ensuring a significant reduction of network traffic during training.Item Open Access Leveraging stream processing engines in support of physiological data processing(Colorado State University. Libraries, 2018) Mishra, Sitakanta, author; Pallickara, Shrideep, advisor; Pallickara, Sangmi, committee member; Venkatachalam, Chandra, committee memberOver the last decade, there has been an exponential growth in unbounded streaming data generated by sensing devices in different settings including the Internet-of-Things. Several frameworks have been developed to facilitate effective monitoring, processing, and analysis of the continuous flow of streams generated in such settings. Real-time data collected from patient monitoring systems, wearable devices etc. can take advantage of stream processing engines in distributed computing environments to provide better care and services to both individuals and medical practitioners. This thesis proposes a methodology for monitoring multiple users using stream data processing pipelines. We have designed data processing pipelines using the two dominant stream processing frameworks – Storm and Spark. We used the University of Queensland's Vital Sign Dataset in our assessments. Our assessments contrast these systems based on processing latencies, throughput, and also the number of concurrent users that can be supported in a given pipeline.Item Open Access Low-latency, query-driven analytics over voluminous multidimensional, spatiotemporal datasets(Colorado State University. Libraries, 2017) Malensek, Matthew, author; Pallickara, Shrideep, advisor; Pallickara, Sangmi Lee, advisor; Bohm, A. P. Willem, committee member; Draper, Bruce, committee member; Breidt, F. Jay, committee memberUbiquitous data collection from sources such as remote sensing equipment, networked observational devices, location-based services, and sales tracking has led to the accumulation of voluminous datasets; IDC projects that by 2020 we will generate 40 zettabytes of data per year, while Gartner and ABI estimate 20-35 billion new devices will be connected to the Internet in the same time frame. The storage and processing requirements of these datasets far exceed the capabilities of modern computing hardware, which has led to the development of distributed storage frameworks that can scale out by assimilating more computing resources as necessary. While challenging in its own right, storing and managing voluminous datasets is only the precursor to a broader field of study: extracting knowledge, insights, and relationships from the underlying datasets. The basic building block of this knowledge discovery process is analytic queries, encompassing both query instrumentation and evaluation. This dissertation is centered around query-driven exploratory and predictive analytics over voluminous, multidimensional datasets. Both of these types of analysis represent a higher-level abstraction over classical query models; rather than indexing every discrete value for subsequent retrieval, our framework autonomously learns the relationships and interactions between dimensions in the dataset (including time series and geospatial aspects), and makes the information readily available to users. This functionality includes statistical synopses, correlation analysis, hypothesis testing, probabilistic structures, and predictive models that not only enable the discovery of nuanced relationships between dimensions, but also allow future events and trends to be predicted. This requires specialized data structures and partitioning algorithms, along with adaptive reductions in the search space and management of the inherent trade-off between timeliness and accuracy. The algorithms presented in this dissertation were evaluated empirically on real-world geospatial time-series datasets in a production environment, and are broadly applicable across other storage frameworks.Item Open Access Near real-time processing of voluminous, high-velocity data streams for continuous sensing environments(Colorado State University. Libraries, 2020) Hewa Raga Munige, Thilina, author; Pallickara, Shrideep, advisor; Chandrasekar, V., committee member; Ghosh, Sudipto, committee member; Pallickara, Sangmi, committee memberRecent advancements in miniaturization, falling costs, networking enhancements, and battery technologies have contributed to a proliferation of networked sensing devices. Arrays of coordinated sensing devices are deployed in continuous sensing environments (CSEs) where the phenomena of interest are monitored. Observations sensed by devices in a CSE setting are encapsulated as multidimensional data streams that must subsequently be processed. The vast number of sensing devices, the high rates at which data are generated, and the high-resolutions at which these measurements are performed contribute to the voluminous, high-velocity data streams that are now increasingly pervasive. These data streams must be processed in near real-time to power user-facing applications such as visualization dashboards and monitoring systems, as well as various stages of data ingestion pipelines such as ETL pipelines. This dissertation focuses on facilitating efficient ingestion and near real-time processing of voluminous, high-velocity data streams originating in CSEs. Challenges in ingesting and processing such streams include energy and bandwidth constraints at the data sources, data transfer and processing costs, underutilized resources, and preserving the performance of stream processing applications in the presence of variable workloads and system conditions. Toward this end, we explore design principles to build a high-performant and adaptive stream processing engine to address processing challenges that are unique to CSE data streams. Further, we demonstrate how our holistic methodology based on space-efficient representations of data streams through a controlled trade-off of accuracy, can substantially alleviate stream ingestion challenges while improving the stream processing performance. We evaluate the efficacy of our methodology using real-world streaming datasets in a large-scale setup and contrast against the state-of-the-art developments in the field.Item Open Access On the evaluation of exact-match and range queries over multidimensional data in distributed hash tables(Colorado State University. Libraries, 2012) Malensek, Matthew, author; Pallickara, Shrideep, advisor; Draper, Bruce, committee member; Randall, David, committee memberThe quantity and precision of geospatial and time series observational data being collected has increased alongside the steady expansion of processing and storage capabilities in modern computing hardware. The storage requirements for this information are vastly greater than the capabilities of a single computer, and are primarily met in a distributed manner. However, distributed solutions often impose strict constraints on retrieval semantics. In this thesis, we investigate the factors that influence storage and retrieval operations on large datasets in a cloud setting, and propose a lightweight data partitioning and indexing scheme to facilitate these operations. Our solution provides expressive retrieval support through range-based and exact-match queries and can be applied over massive quantities of multidimensional data. We provide benchmarks to illustrate the relative advantage of using our solution over a general-purpose cloud storage engine in a distributed network of heterogeneous computing resources.Item Open Access On the role of topology in autonomously coping with failures in content dissemination systems(Colorado State University. Libraries, 2014) Stern, Ryan, author; Pallickara, Shrideep, advisor; Strout, Michelle, committee member; Turk, Daniel, committee memberContent dissemination systems provide a substrate that allows large numbers of entities to communicate with each other. These entities could be processes, sensors, and networked instruments that produce and consume data streams. To ensure scaling, the content dissemination substrate comprises a large number of distributed nodes. As the number of participating nodes increases, the likelihood of failures also increases. These failures can occur for any number of reasons, including: faulty hardware, programmer or user error, power failure, and network outages. Node failures can result in partitions with the original set of connected nodes disintegrating into smaller, disjoint subsets. Brewer's CAP theorem limits the choices for a partitioned system: availability or consistency but not both. It is therefore desirable to ensure that partitions are less likely. This thesis explores how nodes comprising the content dissemination system can be organized into topologies with the objective of improved partition tolerance. The topologies we consider are based on random, regular, power law, and Watts-Strogatz small world graphs. Connections within these topologies can account for network proximity and are suitable for real-time communications. We explore specific attributes of a topology that contribute to its partition resiliency, such as clustering coefficients, distribution of random links, and preferential attachment. Metrics we use to profile suitability of different topologies include: communication path lengths, migration of workloads, and the impact on system throughput. This research will allow designers to choose topologies or configure metrics to achieve performance objectives and the degree of partition tolerance.Item Open Access On the support for heterogeneous languages in cloud runtimes(Colorado State University. Libraries, 2010) Ericson, Kathleen, author; Pallickara, Shrideep, advisor; Bohm, Anton Pedro Willem, 1948-, committee member; Randall, David A. (David Allan), 1948-, committee memberCloud runtimes are an effective method of distributing computations, but often have little support for computations written in diverse languages. We have extended the Granules cloud runtime with a bridge framework that allows computations to be written in a number of languages. Granules computations are dynamic and can be characterized as long-running with intermittent CPU bursts, allowing state to build during successive rounds of execution. Our goal is to develop a framework that supports real-time processing in long-running computations that maintain state across multiple runs of the computation. Due to the nature of Granules computations, we need the bridges to be bidirectional - both Granules and the bridged computation should be able to steer the program flow as needed. In order to conserve resources, and maintain communications during heavy loads, the framework needs to allow communication over multiple channels, and be able to switch the bridging mechanism in a transparent manner. Different communication methods should be available to a computation at all times, without requiring rewrites of an original computation. Our current implementation supports bridging in C, C++, C#, Python, and R. We have also designed a diagnostics system, which gathers information on system state and is able to modify the underlying bridge framework in response to system load. This diagnostics system is capable of initiating a switch of communications methods transparently, which allows the system to free up limited resources as necessary.Item Open Access Preservation of low latency service request processing in dockerized microservice architectures(Colorado State University. Libraries, 2016) Sudalaikkan, Leo Vigneshwaran, author; Pallickara, Shrideep, advisor; Pallickara, Sangmi Lee, committee member; Vijayasarathy, Leo, committee memberOrganizations are increasingly transitioning from monolithic architectures to microservices based architectures. Software built as microservices can be broken into multiple components that are easily deployable and scalable while providing good utilization of resources. A popular approach to building microservices is through containers. Docker is an open source technology for building, deploying, and executing distributed applications within containers that are referred to as pods in Docker orchestrator terminology. The objective of this thesis is the dynamic and targeted scaling of the pods comprising an application to ensure low latency servicing of requests. Our methodology targets the identication of impending latency constraint violations and performs targeted scaling maneuvers to alleviate load at a particular pod. Empirical benchmarks demonstrate the suitability of our approach.Item Open Access Real time stream processing for Internet of things and sensing environments(Colorado State University. Libraries, 2015) Hewa Raga Munige, Thilina, author; Pallickara, Shrideep, advisor; Pallickara, Sangmi, committee member; Turk, Daniel, committee memberImprovements in miniaturization and networking capabilities of sensors have contributed to the proliferation of Internet of Things (IoT) and continuous sensing environments. Data streams generated in such settings must keep pace with generation rates and be processed in real time. Challenges in accomplishing this include: high data arrival rates, buffer overflows, context-switches during processing, and object creation overheads. We propose a holistic framework that addresses the CPU, memory, network, and kernel issues involved in stream processing. Our prototype, Neptune, builds on the Granules cloud runtime and leverages its support for scheduling packets and communications based on publish/subscribe, peer to peer, and point-to-point. The framework maximizes bandwidth utilization in the presence of small messages via the use of buffering and dynamic compactions of packets based on their entropy. Our use of thread-pools and batched processing reduces context switches and improves effective CPU utilizations. The framework alleviates memory pressure that can lead to swapping, page faults, and thrashing through efficient reuse of objects. To cope with buffer overflows we rely on flow control and throttling the preceding stages of a processing pipeline. Our correctness criteria included deadlock/livelock avoidance, and ordered and exactly-once processing. Our benchmarks demonstrate the suitability of the Granules/Neptune combination and we contrast our performance with Apache Storm, the dominant stream-processing framework developed by Twitter. At a single node, we are able to achieve a processing rate of ~2 million stream packets per-second. In a distributed cluster setup, we are able to achieve a processing rate of ~100 million stream packets per-second with a near-optimal bandwidth utilization.