Goto

Collaborating Authors

 Phillips, Jeff M.


Robust High-Dimensional Mean Estimation With Low Data Size, an Empirical Study

arXiv.org Machine Learning

Robust statistics aims to compute quantities to represent data where a fraction of it may be arbitrarily corrupted. The most essential statistic is the mean, and in recent years, there has been a flurry of theoretical advancement for efficiently estimating the mean in high dimensions on corrupted data. While several algorithms have been proposed that achieve near-optimal error, they all rely on large data size requirements as a function of dimension. In this paper, we perform an extensive experimentation over various mean estimation techniques where data size might not meet this requirement due to the highdimensional setting. For data with inliers generated from a Gaussian with known covariance, we find experimentally that several robust mean estimation techniques can practically improve upon the sample mean, with the quantum entropy scaling approach from Dong et.al.


No Dimensional Sampling Coresets for Classification

arXiv.org Artificial Intelligence

We refine and generalize what is known about coresets for classification problems via the sensitivity sampling framework. Such coresets seek the smallest possible subsets of input data, so one can optimize a loss function on the coreset and ensure approximation guarantees with respect to the original data. Our analysis provides the first no dimensional coresets, so the size does not depend on the dimension. Moreover, our results are general, apply for distributional input and can use iid samples, so provide sample complexity bounds, and work for a variety of loss functions. A key tool we develop is a Radamacher complexity version of the main sensitivity sampling approach, which can be of independent interest.


Linear Distance Metric Learning with Noisy Labels

arXiv.org Artificial Intelligence

In linear distance metric learning, we are given data in one Euclidean metric space and the goal is to find an appropriate linear map to another Euclidean metric space which respects certain distance conditions as much as possible. In this paper, we formalize a simple and elegant method which reduces to a general continuous convex loss optimization problem, and for different noise models we derive the corresponding loss functions. We show that even if the data is noisy, the ground truth linear metric can be learned with any precision provided access to enough samples, and we provide a corresponding sample complexity bound. Moreover, we present an effective way to truncate the learned model to a low-rank model that can provably maintain the accuracy in the loss function and in parameters - the first such results of this type. Several experimental observations on synthetic and real data sets support and inform our theoretical results.


Sketching Multidimensional Time Series for Fast Discord Mining

arXiv.org Artificial Intelligence

Time series discords are a useful primitive for time series anomaly detection, and the matrix profile is capable of capturing discord effectively. There exist many research efforts to improve the scalability of discord discovery with respect to the length of time series. However, there is surprisingly little work focused on reducing the time complexity of matrix profile computation associated with dimensionality of a multidimensional time series. In this work, we propose a sketch for discord mining among multi-dimensional time series. After an initial pre-processing of the sketch as fast as reading the data, the discord mining has runtime independent of the dimensionality of the original data. On several real world examples from water treatment and transportation, the proposed algorithm improves the throughput by at least an order of magnitude (50X) and only has minimal impact on the quality of the approximated solution. Additionally, the proposed method can handle the dynamic addition or deletion of dimensions inconsequential overhead. This allows a data analyst to consider "what-if" scenarios in real time while exploring the data.


An Efficient Content-based Time Series Retrieval System

arXiv.org Artificial Intelligence

A Content-based Time Series Retrieval (CTSR) system is an information retrieval system for users to interact with time series emerged from multiple domains, such as finance, healthcare, and manufacturing. For example, users seeking to learn more about the source of a time series can submit the time series as a query to the CTSR system and retrieve a list of relevant time series with associated metadata. By analyzing the retrieved metadata, users can gather more information about the source of the time series. Because the CTSR system is required to work with time series data from diverse domains, it needs a high-capacity model to effectively measure the similarity between different time series. On top of that, the model within the CTSR system has to compute the similarity scores in an efficient manner as the users interact with the system in real-time. In this paper, we propose an effective and efficient CTSR model that outperforms alternative models, while still providing reasonable inference runtimes. To demonstrate the capability of the proposed method in solving business problems, we compare it against alternative models using our in-house transaction data. Our findings reveal that the proposed model is the most suitable solution compared to others for our transaction data problem.


For Kernel Range Spaces a Constant Number of Queries Are Sufficient

arXiv.org Artificial Intelligence

We introduce the notion of an $\varepsilon$-cover for a kernel range space. A kernel range space concerns a set of points $X \subset \mathbb{R}^d$ and the space of all queries by a fixed kernel (e.g., a Gaussian kernel $K(p,\cdot) = \exp(-\|p-\cdot\|^2)$). For a point set $X$ of size $n$, a query returns a vector of values $R_p \in \mathbb{R}^n$, where the $i$th coordinate $(R_p)_i = K(p,x_i)$ for $x_i \in X$. An $\varepsilon$-cover is a subset of points $Q \subset \mathbb{R}^d$ so for any $p \in \mathbb{R}^d$ that $\frac{1}{n} \|R_p - R_q\|_1\leq \varepsilon$ for some $q \in Q$. This is a smooth analog of Haussler's notion of $\varepsilon$-covers for combinatorial range spaces (e.g., defined by subsets of points within a ball query) where the resulting vectors $R_p$ are in $\{0,1\}^n$ instead of $[0,1]^n$. The kernel versions of these range spaces show up in data analysis tasks where the coordinates may be uncertain or imprecise, and hence one wishes to add some flexibility in the notion of inside and outside of a query range. Our main result is that, unlike combinatorial range spaces, the size of kernel $\varepsilon$-covers is independent of the input size $n$ and dimension $d$. We obtain a bound of $(1/\varepsilon)^{\tilde O(1/\varepsilon^2)}$, where $\tilde{O}(f(1/\varepsilon))$ hides log factors in $(1/\varepsilon)$ that can depend on the kernel. This implies that by relaxing the notion of boundaries in range queries, eventually the curse of dimensionality disappears, and may help explain the success of machine learning in very high-dimensions. We also complement this result with a lower bound of almost $(1/\varepsilon)^{\Omega(1/\varepsilon)}$, showing the exponential dependence on $1/\varepsilon$ is necessary.


Practical and Configurable Network Traffic Classification Using Probabilistic Machine Learning

arXiv.org Artificial Intelligence

Network traffic classification that is widely applicable and highly accurate is valuable for many network security and management tasks. A flexible and easily configurable classification framework is ideal, as it can be customized for use in a wide variety of networks. In this paper, we propose a highly configurable and flexible machine learning traffic classification method that relies only on statistics of sequences of packets to distinguish known, or approved, traffic from unknown traffic. Our method is based on likelihood estimation, provides a measure of certainty for classification decisions, and can classify traffic at adjustable certainty levels. Our classification method can also be applied in different classification scenarios, each prioritizing a different classification goal. We demonstrate how our classification scheme and all its configurations perform well on real-world traffic from a high performance computing network environment.


The Kernel Spatial Scan Statistic

arXiv.org Machine Learning

Kulldorff's (1997) seminal paper on spatial scan statistics (SSS) has led to many methods considering different regions of interest, different statistical models, and different approximations while also having numerous applications in epidemiology, environmental monitoring, and homeland security. SSS provides a way to rigorously test for the existence of an anomaly and provide statistical guarantees as to how "anomalous" that anomaly is. However, these methods rely on defining specific regions where the spatial information a point contributes is limited to binary 0 or 1, of either inside or outside the region, while in reality anomalies will tend to follow smooth distributions with decaying density further from an epicenter. In this work, we propose a method that addresses this shortcoming through a continuous scan statistic that generalizes SSS by allowing the point contribution to be defined by a kernel. We provide extensive experimental and theoretical results that shows our methods can be computed efficiently while providing high statistical power for detecting anomalous regions.


Relative Error RKHS Embeddings for Gaussian Kernels

arXiv.org Machine Learning

We show how to obliviously embed into the reproducing kernel Hilbert space associated with Gaussian kernels, so that distance in this space (the kernel distance) only has $(1+\varepsilon)$-relative error. This only holds in comparing any point sets at a kernel distance at least $\alpha$; this parameter only shows up as a poly-logarithmic factor of the dimension of an intermediate embedding, but not in the final embedding. The main insight is to effectively modify the well-traveled random Fourier features to be slightly biased and have higher variance, but so they can be defined as a convolution over the function space. This result provides the first guaranteed algorithmic results for LSH of kernel distance on point sets and low-dimensional shapes and distributions, and for relative error bounds on the kernel two-sample test.


Absolute Orientation for Word Embedding Alignment

arXiv.org Machine Learning

We propose a new technique to align word embeddings which are derived from different source datasets or created using different mechanisms (e.g., GloVe or word2vec). We design a simple, closed-form solution to find the optimal rotation and optionally scaling which minimizes the root mean squared error or maximizes the average cosine similarity between two embeddings of the same vocabulary into the same dimensional space. Our methods extend approaches known as Absolute Orientation, which are popular for aligning objects in three-dimensions. We extend them to arbitrary dimensions, and show that a simple scaling solution can be derived independent of the rotation, and also that it optimizes cosine similarity. Then we demonstrate how to evaluate the similarity of embeddings from different sources or mechanisms, and that certain properties like synonyms and analogies are preserved across the embeddings and can be enhanced by simply aligning and averaging ensembles of embeddings.