April 16, 2018 –
Title: Performance Prediction and Tuning for Large-Scale Data Analytics Systems
PhD Candidate: Nhan Nguyen
Major Advisor: Dr. Mohammad Maifi Hasan Khan
Associate Advisors: Dr. Swapna Gokhale, Dr. Bing Wang, Dr. Song Han
Date/Time: Monday, April 16th, 2018 at 2:00 PM
Location: Laurel Hall 111
Cloud-based solutions are increasingly being used to implement large-scale dynamic data driven application systems such as smart grid monitoring, remote surveillance, and Internet of Things (IoT) applications. These systems are often characterized by multiple layers of software where the data analytic layer (e.g., Apache Spark) runs on top of the storage layer responsible for handling and storing the streaming raw data in replicated cloud-database systems (e.g., Apache Cassandra). Despite the advantages offered by these cloud-based solutions (e.g., fault tolerance, scalability), unpredictable load conditions, inter-layer dependencies and a high-degree of configurability make it difficult to ascertain and address suboptimal performance problems while minimizing resource requirements. To address this, we investigate scalable solutions for automated performance tuning and resource allocation across multiple layers in cloud settings as follows.
First, given the multi-layer architecture of cloud-based solutions, we focus on performance at the storage layer. Specifically, we design and implement a proactive resource allocation framework for predicting future workloads in real-time and redistributing sensor streams to different servers as needed to minimize the loss of data as well as the resource requirements. To further reduce the resource requirements, we design a model-driven middleware service that leverages a priori constructed models to predict and mitigate possible overload conditions by splitting streams into multiple substreams.
Next, to improve the performance of data analytic layer, we investigate the challenge of implementing fault-tolerant parallel data analytic tasks leveraging the services provided by the underlying cloud storage layer (e.g., data replication, node failure detection). To evaluate our work, we design the data structures and data models needed for effective parallelization of the motif mining algorithm that enables the data analysis system to recover from arbitrary node failures by redistributing computational tasks in real-time and reduce execution time.
Finally, we focus on configuration tuning in cloud settings to improve system performance without allocating additional hardware resources. Specifically, we investigate a scalable framework that synthesizes measurements from multiple runs and use machine learning methods to create performance influence models for different configuration settings. We then apply the constructed performance models to tune the performance of different applications.