Goto

Collaborating Authors

 Clustering


MaxMin Linear Initialization for Fuzzy C-Means

arXiv.org Machine Learning

Clustering is an extensive research area in data science. The aim of clustering is to discover groups and to identify interesting patterns in datasets. Crisp (hard) clustering considers that each data point belongs to one and only one cluster. However, it is inadequate as some data points may belong to several clusters, as is the case in text categorization. Thus, we need more flexible clustering. Fuzzy clustering methods, where each data point can belong to several clusters, are an interesting alternative. Yet, seeding iterative fuzzy algorithms to achieve high quality clustering is an issue. In this paper, we propose a new linear and efficient initialization algorithm MaxMin Linear to deal with this problem. Then, we validate our theoretical results through extensive experiments on a variety of numerical real-world and artificial datasets. We also test several validity indices, including a new validity index that we propose, Transformed Standardized Fuzzy Difference (TSFD).


Fusion Subspace Clustering: Full and Incomplete Data

arXiv.org Machine Learning

Inferring low-dimensional structures that explain high-dimensional data has become a cornerstone of discovery in virtually all fields of science. Principal component analysis (PCA), which identifies the low-dimensional linear subspace that best explains a dataset, is arguably the most prominent technique for this purpose. However, in many applications -- computer vision, image processing, bioinformatics, linguistics, networks analysis, and more [1-10] -- data is often composed of a mixture of several classes, each of which can be explained with a different subspace. Clustering and inferring subspaces that explain data is an important unsupervised learning problem that has received tremendous 1 attention in recent years, producing theory and algorithms to handle outliers, noisy measurements, privacy concerns, and data constraints, among other difficulties [11-22]. However, one major challenge in contemporary problems is that data is often incomplete. For example, in image inpainting, the values of some pixels are missing due to faulty sensors and image contamination [23]; in computer vision features are often missing due to occlusions and tracking algorithms malfunctions [24]; in recommender systems each user only rates a limited number of items [25]; in a network, most nodes communicate in subsets, producing only a handful of all the possible measurements [7].


Using Feature Grouping as a Stochastic Regularizer for High-Dimensional Noisy Data

arXiv.org Machine Learning

The use of complex models --with many parameters-- is challenging with high-dimensional small-sample problems: indeed, they face rapid overfitting. Such situations are common when data collection is expensive, as in neuroscience, biology, or geology. Dedicated regularization can be crafted to tame overfit, typically via structured penalties. But rich penalties require mathematical expertise and entail large computational costs. Stochastic regularizers such as dropout are easier to implement: they prevent overfitting by random perturbations. Used inside a stochastic optimizer, they come with little additional cost. We propose a structured stochastic regularization that relies on feature grouping. Using a fast clustering algorithm, we define a family of groups of features that capture feature covariations. We then randomly select these groups inside a stochastic gradient descent loop. This procedure acts as a structured regularizer for high-dimensional correlated data without additional computational cost and it has a denoising effect. We demonstrate the performance of our approach for logistic regression both on a sample-limited face image dataset with varying additive noise and on a typical high-dimensional learning problem, brain image classification.


K-medoids Clustering of Data Sequences with Composite Distributions

arXiv.org Machine Learning

This paper studies clustering of data sequences using the k-medoids algorithm. All the data sequences are assumed to be generated from \emph{unknown} continuous distributions, which form clusters with each cluster containing a composite set of closely located distributions (based on a certain distance metric between distributions). The maximum intra-cluster distance is assumed to be smaller than the minimum inter-cluster distance, and both values are assumed to be known. The goal is to group the data sequences together if their underlying generative distributions (which are unknown) belong to one cluster. Distribution distance metrics based k-medoids algorithms are proposed for known and unknown number of distribution clusters. Upper bounds on the error probability and convergence results in the large sample regime are also provided. It is shown that the error probability decays exponentially fast as the number of samples in each data sequence goes to infinity. The error exponent has a simple form regardless of the distance metric applied when certain conditions are satisfied. In particular, the error exponent is characterized when either the Kolmogrov-Smirnov distance or the maximum mean discrepancy are used as the distance metric. Simulation results are provided to validate the analysis.


A Group-Theoretic Approach to Abstraction: Hierarchical, Interpretable, and Task-Free Clustering

arXiv.org Machine Learning

Abstraction plays a key role in concept learning and knowledge discovery. While pervasive in both human and artificial intelligence, it remains mysterious how concepts are abstracted in the first place. We study the nature of abstraction through a group-theoretic approach, formalizing it as a hierarchical, interpretable, and task-free clustering problem. This clustering framework is data-free, feature-free, similarity-free, and globally hierarchical---the four key features that distinguish it from common clustering models. Beyond a theoretical foundation for abstraction, we also present a top-down and a bottom-up approach to establish an algorithmic foundation for practical abstraction-generating methods. Lastly, using both a theoretical explanation and a real-world application, we show that the coupling of our abstraction framework with statistics realizes Shannon's information lattice and even further, brings learning into the picture. This gives a first step towards a principled and cognitive way of automatic concept learning and knowledge discovery.


Call Detail Records Driven Anomaly Detection and Traffic Prediction in Mobile Cellular Networks

arXiv.org Artificial Intelligence

Mobile networks possess information about the users as well as the network. Such information is useful for making the network end-to-end visible and intelligent. Big data analytics can efficiently analyze user and network information, unearth meaningful insights with the help of machine learning tools. Utilizing big data analytics and machine learning, this work contributes in three ways. First, we utilize the call detail records (CDR) data to detect anomalies in the network. For authentication and verification of anomalies, we use k-means clustering, an unsupervised machine learning algorithm. Through effective detection of anomalies, we can proceed to suitable design for resource distribution as well as fault detection and avoidance. Second, we prepare anomaly-free data by removing anomalous activities and train a neural network model. By passing anomaly and anomaly-free data through this model, we observe the effect of anomalous activities in training of the model and also observe mean square error of anomaly and anomaly free data. Lastly, we use an autoregressive integrated moving average (ARIMA) model to predict future traffic for a user. Through simple visualization, we show that anomaly free data better generalizes the learning models and performs better on prediction task.


Understanding V2V Driving Scenarios through Traffic Primitives

arXiv.org Machine Learning

Semantically understanding complex drivers' encountering behavior, wherein two or multiple vehicles are spatially close to each other, does potentially benefit autonomous car's decision-making design. This paper presents a framework of analyzing various encountering behaviors through decomposing driving encounter data into small building blocks, called driving primitives, using nonparametric Bayesian learning (NPBL) approaches, which offers a flexible way to gain an insight into the complex driving encounters without any prerequisite knowledge. The effectiveness of our proposed primitive-based framework is validated based on 976 naturalistic driving encounters, from which more than 4000 driving primitives are learned using NPBL - a sticky HDP-HMM, combined a hidden Markov model (HMM) with a hierarchical Dirichlet process (HDP). After that, a dynamic time warping method integrated with k-means clustering is then developed to cluster all these extracted driving primitives into groups. Experimental results find that there exist 20 kinds of driving primitives capable of representing the basic components of driving encounters in our database. This primitive-based analysis methodology potentially reveals underlying information of vehicle-vehicle encounters for self-driving applications.


Selective Clustering Annotated using Modes of Projections

arXiv.org Machine Learning

Selective clustering annotated using modes of projections (SCAMP) is a new clustering algorithm for data in $\mathbb{R}^p$. SCAMP is motivated from the point of view of non-parametric mixture modeling. Rather than maximizing a classification likelihood to determine cluster assignments, SCAMP casts clustering as a search and selection problem. One consequence of this problem formulation is that the number of clusters is $\textbf{not}$ a SCAMP tuning parameter. The search phase of SCAMP consists of finding sub-collections of the data matrix, called candidate clusters, that obey shape constraints along each coordinate projection. An extension of the dip test of Hartigan and Hartigan (1985) is developed to assist the search. Selection occurs by scoring each candidate cluster with a preference function that quantifies prior belief about the mixture composition. Clustering proceeds by selecting candidates to maximize their total preference score. SCAMP concludes by annotating each selected cluster with labels that describe how cluster-level statistics compare to certain dataset-level quantities. SCAMP can be run multiple times on a single data matrix. Comparison of annotations obtained across iterations provides a measure of clustering uncertainty. Simulation studies and applications to real data are considered. A C++ implementation with R interface is $\href{https://github.com/RGLab/scamp}{available\ online}$.


Semantically Meaningful View Selection

arXiv.org Artificial Intelligence

An understanding of the nature of objects could help robots to solve both high-level abstract tasks and improve performance at lower-level concrete tasks. Although deep learning has facilitated progress in image understanding, a robot's performance in problems like object recognition often depends on the angle from which the object is observed. Traditionally, robot sorting tasks rely on a fixed top-down view of an object. By changing its viewing angle, a robot can select a more semantically informative view leading to better performance for object recognition. In this paper, we introduce the problem of semantic view selection, which seeks to find good camera poses to gain semantic knowledge about an observed object. We propose a conceptual formulation of the problem, together with a solvable relaxation based on clustering. We then present a new image dataset consisting of around 10k images representing various views of 144 objects under different poses. Finally we use this dataset to propose a first solution to the problem by training a neural network to predict a "semantic score" from a top view image and camera pose. The views predicted to have higher scores are then shown to provide better clustering results than fixed top-down views.


Decentralized Task Allocation in Multi-Robot Systems via Bipartite Graph Matching Augmented with Fuzzy Clustering

arXiv.org Artificial Intelligence

Robotic systems, working together as a team, are becoming valuable players in different real-world applications, from disaster response to warehouse fulfillment services. Centralized solutions for coordinating multi-robot teams often suffer from poor scalability and vulnerability to communication disruptions. This paper develops a decentralized multi-agent task allocation (Dec-MATA) algorithm for multi-robot applications. The task planning problem is posed as a maximum-weighted matching of a bipartite graph, the solution of which using the blossom algorithm allows each robot to autonomously identify the optimal sequence of tasks it should undertake. The graph weights are determined based on a soft clustering process, which also plays a problem decomposition role seeking to reduce the complexity of the individual-agents' task assignment problems. To evaluate the new Dec-MATA algorithm, a series of case studies (of varying complexity) are performed, with tasks being distributed randomly over an observable 2D environment. A centralized approach, based on a state-of-the-art MILP formulation of the multi-Traveling Salesman problem is used for comparative analysis. While getting within 7-28% of the optimal cost obtained by the centralized algorithm, the Dec-MATA algorithm is found to be 1-3 orders of magnitude faster and minimally sensitive to task-to-robot ratios, unlike the centralized algorithm.