December 7, 2017 –

**Title:** Hierarchical Structures for High Dimensional Data Analysis

**Major Advisor: **Dr. Donald Sheehy

**Associate Advisors:** Dr. Thomas Peters, Dr. Sanguthevar Rajasekaran

**Date/Time:** Thursday, December 7, 2017 at 10:00am

**Location:** Homer Babbidge Library Class of 1947 Conference Room

**Abstract:**

The volume of data is not the only problem in modern data analysis, data complexity is often more challenging. In many areas such as computational biology, topological data analysis, and machine learning, the data resides in high dimensional spaces which may not even be Euclidean. Therefore, processing such massive and complex data and extracting some useful information is a big challenge. Here, the input is a set of objects and a metric that measures the distance between pairs of objects.

In this proposal, we first consider the problem of preprocessing and organizing such complex data into a hierarchical data structure that allows efficient nearest neighbor and range queries. There have been many data structures for general metric spaces, but almost all of them have construction time that can be quadratic in terms of the number of points. There are only two data structures with O(n log n) construction time but both have very complex algorithms and analyses, and they cannot be implemented efficiently. Here, we present a simple randomized incremental algorithm that builds a metric data structure in O(n log n) time in expectation. Thus, we achieve the best of both worlds, simple implementation with theoretically optimal performance.

Furthermore, we consider the close relationship between our metric data structure and orderings on the points used in applications such as k-center clustering. We give linear time algorithms to go back and forth between these orderings and our metric data structure.

In the last part of this proposal, we use metric data structures to extract topological features of a data set, such as the number of connected components, holes, and voids. We give a linear time algorithm for constructing a (1 + ε)-approximation to the so-called Vietoris-Rips filtration of a metric space, a fundamental tool in topological data analysis.