Goto

Collaborating Authors

 Clustering


Interpret Principal Component Analysis (PCA)

#artificialintelligence

Data can tell us stories. That's what I've been told anyway. As a Data Scientist working for Fortune 300 clients, I deal with tons of data daily, I can tell you that data can tell us stories. You can apply a regression, classification or a clustering algorithm on the data, but feature selection and engineering can be a daunting task. A lot of times, I have seen data scientists take an automated approach to feature selection such as Recursive Feature Elimination (RFE) or leverage Feature Importance algorithms using Random Forest or XGBoost.


A Global Constraint for the Exact Cover Problem: Application to Conceptual Clustering

Journal of Artificial Intelligence Research

We introduce the exactCover global constraint dedicated to the exact cover problem, the goal of which is to select subsets such that each element of a given set belongs to exactly one selected subset. This NP-complete problem occurs in many applications, and we more particularly focus on a conceptual clustering application. We introduce three propagation algorithms for exactCover, called Basic, DL, and DL+: Basic ensures the same level of consistency as arc consistency on a classical decomposition of exactCover into binary constraints, without using any specific data structure; DL ensures the same level of consistency as Basic but uses Dancing Links to efficiently maintain the relation between elements and subsets; and DL+ is a stronger propagator which exploits an extra property to filter more values than DL. We also consider the case where the number of selected subsets is constrained to be equal to a given integer variable k, and we show that this may be achieved either by combining exactCover with existing constraints, or by designing a specific propagator that integrates algorithms designed for the NValues constraint. These different propagators are experimentally evaluated on conceptual clustering problems, and they are compared with state-of-the-art declarative approaches. In particular, we show that our global constraint is competitive with recent ILP and CP models for mono-criterion problems, and it has better scale-up properties for multi-criteria problems.


Autoencoders

arXiv.org Machine Learning

An autoencoder is a specific type of a neural network, which is mainlydesigned to encode the input into a compressed and meaningful representation, andthen decode it back such that the reconstructed input is similar as possible to theoriginal one. This chapter surveys the different types of autoencoders that are mainlyused today. It also describes various applications and use-cases of autoencoders.



An API Oriented Open-source Python Framework for Unsupervised Learning on Graphs

arXiv.org Machine Learning

We present Karate Club a Python framework combining more than 30 state-of-the-art graph mining algorithms which can solve unsupervised machine learning tasks. The primary goal of the package is to make community detection, node and whole graph embedding available to a wide audience of machine learning researchers and practitioners. We designed Karate Club with an emphasis on a consistent application interface, scalability, ease of use, sensible out of the box model behaviour, standardized dataset ingestion, and output generation. This paper discusses the design principles behind this framework with practical examples. We show Karate Club's efficiency with respect to learning performance on a wide range of real world clustering problems, classification tasks and support evidence with regards to its competitive speed.


Sets Clustering

arXiv.org Machine Learning

The input to the \emph{sets-$k$-means} problem is an integer $k\geq 1$ and a set $\mathcal{P}=\{P_1,\cdots,P_n\}$ of sets in $\mathbb{R}^d$. The goal is to compute a set $C$ of $k$ centers (points) in $\mathbb{R}^d$ that minimizes the sum $\sum_{P\in \mathcal{P}} \min_{p\in P, c\in C}\left\| p-c \right\|^2$ of squared distances to these sets. An \emph{$\varepsilon$-core-set} for this problem is a weighted subset of $\mathcal{P}$ that approximates this sum up to $1\pm\varepsilon$ factor, for \emph{every} set $C$ of $k$ centers in $\mathbb{R}^d$. We prove that such a core-set of $O(\log^2{n})$ sets always exists, and can be computed in $O(n\log{n})$ time, for every input $\mathcal{P}$ and every fixed $d,k\geq 1$ and $\varepsilon \in (0,1)$. The result easily generalized for any metric space, distances to the power of $z>0$, and M-estimators that handle outliers. Applying an inefficient but optimal algorithm on this coreset allows us to obtain the first PTAS ($1+\varepsilon$ approximation) for the sets-$k$-means problem that takes time near linear in $n$. This is the first result even for sets-mean on the plane ($k=1$, $d=2$). Open source code and experimental results for document classification and facility locations are also provided.


Deep Inverse Feature Learning: A Representation Learning of Error

arXiv.org Machine Learning

This paper introduces a novel perspective about error in machine learning and proposes inverse feature learning (IFL) as a representation learning approach that learns a set of high-level features based on the representation of error for classification or clustering purposes. The proposed perspective about error representation is fundamentally different from current learning methods, where in classification approaches they interpret the error as a function of the differences between the true labels and the predicted ones or in clustering approaches, in which the clustering objective functions such as compactness are used. Inverse feature learning method operates based on a deep clustering approach to obtain a qualitative form of the representation of error as features. The performance of the proposed IFL method is evaluated by applying the learned features along with the original features, or just using the learned features in different classification and clustering techniques for several data sets. The experimental results show that the proposed method leads to promising results in classification and especially in clustering. In classification, the proposed features along with the primary features improve the results of most of the classification methods on several popular data sets. In clustering, the performance of different clustering methods is considerably improved on different data sets. There are interesting results that show some few features of the representation of error capture highly informative aspects of primary features. We hope this paper helps to utilize the error representation learning in different feature learning domains.


Collaborative Learning of Semi-Supervised Clustering and Classification for Labeling Uncurated Data

arXiv.org Machine Learning

Domain-specific image collections present potential value in various areas of science and business but are often not curated nor have any way to readily extract relevant content. To employ contemporary supervised image analysis methods on such image data, they must first be cleaned and organized, and then manually labeled for the nomenclature employed in the specific domain, which is a time consuming and expensive endeavor. To address this issue, we designed and implemented the Plud system. Plud provides an iterative semi-supervised workflow to minimize the effort spent by an expert and handles realistic large collections of images. We believe it can support labeling datasets regardless of their size and type. Plud is an iterative sequence of unsupervised clustering, human assistance, and supervised classification. With each iteration 1) the labeled dataset grows, 2) the generality of the classification method and its accuracy increases, and 3) manual effort is reduced. We evaluated the effectiveness of our system, by applying it on over a million images documenting human decomposition. In our experiment comparing manual labeling with labeling conducted with the support of Plud, we found that it reduces the time needed to label data and produces highly accurate models for this new domain.


Nearly Optimal Risk Bounds for Kernel K-Means

arXiv.org Machine Learning

In this paper, we study the statistical properties of the kernel $k$-means and obtain a nearly optimal excess risk bound, substantially improving the state-of-art bounds in the existing clustering risk analyses. We further analyze the statistical effect of computational approximations of the Nystr\"{o}m kernel $k$-means, and demonstrate that it achieves the same statistical accuracy as the exact kernel $k$-means considering only $\sqrt{nk}$ Nystr\"{o}m landmark points. To the best of our knowledge, such sharp excess risk bounds for kernel (or approximate kernel) $k$-means have never been seen before.


Inverse Feature Learning: Feature learning based on Representation Learning of Error

arXiv.org Machine Learning

This paper proposes inverse feature learning as a novel supervised feature learning technique that learns a set of high-level features for classification based on an error representation approach. The key contribution of this method is to learn the representation of error as high-level features, while current representation learning methods interpret error by loss functions which are obtained as a function of differences between the true labels and the predicted ones. One advantage of such learning method is that the learned features for each class are independent of learned features for other classes; therefore, this method can learn simultaneously meaning that it can learn new classes without retraining. Error representation learning can also help with generalization and reduce the chance of over-fitting by adding a set of impactful features to the original data set which capture the relationships between each instance and different classes through an error generation and analysis process. This method can be particularly effective in data sets, where the instances of each class have diverse feature representations or the ones with imbalanced classes. The experimental results show that the proposed method results in significantly better performance compared to the state-of-the-art classification techniques for several popular data sets. We hope this paper can open a new path to utilize the proposed perspective of error representation learning in different feature learning domains.