gap statistics
Adaptively Robust and Sparse K-means Clustering
Li, Hao, Sugasawa, Shonosuke, Katayama, Shota
While K-means is known to be a standard clustering algorithm, it may be compromised due to the presence of outliers and high-dimensional noisy variables. This paper proposes adaptively robust and sparse K-means clustering (ARSK) to address these practical limitations of the standard K-means algorithm. We introduce a redundant error component for each observation for robustness, and this additional parameter is penalized using a group sparse penalty. To accommodate the impact of high-dimensional noisy variables, the objective function is modified by incorporating weights and implementing a penalty to control the sparsity of the weight vector. The tuning parameters to control the robustness and sparsity are selected by Gap statistics. Through simulation experiments and real data analysis, we demonstrate the superiority of the proposed method to existing algorithms in identifying clusters without outliers and informative variables simultaneously.
- North America > United States > California (0.04)
- Asia > Japan (0.04)
Asymptotics for The $k$-means
Clustering is one of the most important unsupervised learning techniques for understanding the underlying data structures. The goal is to partition a data set into many subsets, called clusters, such that the observations within the subsets are the most homogeneous and the observations between the subsets are the most heterogeneous. Clustering is usually carried out by specifying a similarity or dissimilarity measure between observations. Examples include the k-means [17, 19, 29, 37], the k-medians [3], the k-modes [5], and the generalized k-means [2, 31, 45], as well as many of their modifications [21, 24, 42]. Among those, the k-means has been considered as one of the most straightforward and popular methods since it was proposed sixty years ago [23, 36]. Although it is well known, the investigation of the theoretical properties is still far behind, leading to difficulties in developing more precise k-means methods in practice. The goal of the present research is to propose a new concept called clustering consistency for the asymptotics of the k-means with a resulting clustering method better than the existing k-means methods adopted by many software packages, including those adopted by R and Python.
- Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.14)
- North America > United States > Pennsylvania > Philadelphia County > Philadelphia (0.04)
- North America > United States > New York (0.04)
- (4 more...)
Data-Driven Learning of the Number of States in Multi-State Autoregressive Models
Ding, Jie, Noshad, Mohammad, Tarokh, Vahid
In this work, we consider the class of multi-state autoregressive processes that can be used to model non-stationary time-series of interest. In order to capture different autoregressive (AR) states underlying an observed time series, it is crucial to select the appropriate number of states. We propose a new model selection technique based on the Gap statistics, which uses a null reference distribution on the stable AR filters to check whether adding a new AR state significantly improves the performance of the model. To that end, we define a new distance measure between AR filters based on mean squared prediction error (MSPE), and propose an efficient method to generate random stable filters that are uniformly distributed in the coefficient space. Numerical results are provided to evaluate the performance of the proposed approach.
- Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Undirected Networks > Markov Models (0.68)
Learning the Number of Autoregressive Mixtures in Time Series Using the Gap Statistics
Ding, Jie, Noshad, Mohammad, Tarokh, Vahid
Using a proper model to characterize a time series is crucial in making accurate predictions. In this work we use time-varying autoregressive process (TVAR) to describe non-stationary time series and model it as a mixture of multiple stable autoregressive (AR) processes. We introduce a new model selection technique based on Gap statistics to learn the appropriate number of AR filters needed to model a time series. We define a new distance measure between stable AR filters and draw a reference curve that is used to measure how much adding a new AR filter improves the performance of the model, and then choose the number of AR filters that has the maximum gap with the reference curve. To that end, we propose a new method in order to generate uniform random stable AR filters in root domain. Numerical results are provided demonstrating the performance of the proposed approach.
- North America > United States > Massachusetts > Middlesex County > Cambridge (0.04)
- North America > Trinidad and Tobago > Trinidad > Arima > Arima (0.04)
- Europe > United Kingdom (0.04)