AITopics | Clustering

Collaborating Authors

Clustering

Cluster analysis or clustering is the task of grouping a set of objects in such a way that objects in the same group (called a cluster) are more similar (in some sense) to each other than to those in other groups (clusters). It is a main task of exploratory data mining, and a common technique for statistical data analysis, used in many fields, including machine learning, pattern recognition, image analysis, information retrieval, bioinformatics, data compression, and computer graphics. (Wikipedia)

News Overviews Instructional Materials AI-Alerts Classics

Coresets for Clustering with Fairness Constraints

Huang, Lingxiao, Jiang, Shaofeng H. -C., Vishnoi, Nisheeth K.

arXiv.org Machine LearningJun-20-2019

In a recent work, Chierichetti et al. studied the following "fair" variants of classical clustering problems such as $k$-means and $k$-median: given a set of $n$ data points in $\mathbb{R}^d$ and a binary type associated to each data point, the goal is to cluster the points while ensuring that the proportion of each type in each cluster is roughly the same as its underlying proportion. Subsequent work has focused on either extending this setting to when each data point has multiple, non-disjoint sensitive types such as race and gender, or to address the problem that the clustering algorithms in the above work do not scale well. The main contribution of this paper is an approach to clustering with fairness constraints that involve multiple, non-disjoint types, that is {\em also scalable}. Our approach is based on novel constructions of coresets: for the $k$-median objective, we construct an $\varepsilon$-coreset of size $O(\Gamma k^2 \varepsilon^{-d})$ where $\Gamma$ is the number of distinct collections of groups that a point may belong to, and for the $k$-means objective, we show how to construct an $\varepsilon$-coreset of size $O(\Gamma k^3\varepsilon^{-d-1})$. The former result is the first known coreset construction for the fair clustering problem with the $k$-median objective, and the latter result removes the dependence on the size of the full dataset as in Schmidt et al. and generalizes it to multiple, non-disjoint types. Plugging our coresets into existing algorithms for fair clustering such as Backurs et al. results in the fastest algorithms for several cases. Empirically, we assess our approach over the Adult and Bank dataset, and show that the coreset sizes are much smaller than the full dataset; applying coresets indeed accelerates the running time of computing the fair clustering objective while ensuring that the resulting objective difference is small.

artificial intelligence, fairness constraint, machine learning, (2 more...)

arXiv.org Machine Learning

1906.08484

Genre: Research Report (0.69)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.73)

Add feedback

Constrained Bilinear Factorization Multi-view Subspace Clustering

Zheng, Qinghai, Zhu, Jihua, Tian, Zhiqiang, Li, Zhongyu, Pang, Shanmin, Jia, Xiuyi

arXiv.org Machine LearningJun-19-2019

Multi-view clustering is an important and fundamental problem. Many multi-view subspace clustering methods have been proposed and achieved success in real-world applications, most of which assume that all views share a same coefficient matrix. However, the underlying information of multiview data are not exploited effectively under this assumption, since the coefficient matrices of different views should have the same clustering properties rather than be the same among multiple views. To this end, a novel Constrained Bilinear Factorization Multi-view Subspace Clustering (CBF-MSC) method is proposed in this paper. Specifically, the bilinear factorization with an orthonormality constraint and a low-rank constraint is employed for all coefficient matrices to make all coefficient matrices have the same trace-norm instead of being equivalent, so as to explore the consensus information of multi-view data more effectively. Finally, an algorithm based on the Augmented Lagrangian Multiplier (ALM) scheme with alternating direction minimization is designed to optimize the objective function. Comprehensive experiments tested on six benchmark datasets validate the effectiveness and competitiveness of the proposed approach compared with several state-of-the-art approaches.

artificial intelligence, machine learning, matrix, (13 more...)

arXiv.org Machine Learning

1906.08107

Country:

Asia > China > Jiangsu Province > Nanjing (0.04)
Asia > Middle East > Jordan (0.04)
Asia > China > Shaanxi Province > Xi'an (0.04)

Genre: Research Report (0.84)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.49)

Add feedback

Clustering with Fairness Constraints: A Flexible and Scalable Approach

Ziko, Imtiaz Masud, Granger, Eric, Yuan, Jing, Ayed, Ismail Ben

arXiv.org Machine LearningJun-19-2019

This study investigates a general variational formulation of fair clustering, which can integrate fairness constraints with a large class of clustering objectives. Unlike the existing methods, our formulation can impose any desired (target) demographic proportions within each cluster. Furthermore, it enables to control the trade-off between fairness and the clustering objective. We derive an auxiliary function (tight upper bound) of our KL-based fairness penalty via its concave-convex decomposition and Lipschitz-gradient property. Our upper bound can be optimized jointly with various clustering objectives, including both prototype-based such as K-means and graph-based such as Normalized Cut. Interestingly, at each iteration, our general fair-clustering algorithm performs an independent update for each assignment variable, while guaranteeing convergence. Therefore, it can be easily distributed for large-scale data sets. Such scalability is important as it enables to explore different trade-off levels between fairness and clustering objectives. Unlike existing fairness-constrained spectral clustering, our formulation does not need storing an affinity matrix and computing its eigenvalue decomposition. Moreover, unlike existing prototype-based methods, our experiments reveal that fairness does not come at a significant cost of the clustering objective. In fact, several of our tests showed that our fairness penalty helped to avoid weak local minima of the clustering objective (i.e., with fairness, we obtained better clustering objectives). We demonstrate the flexibility and scalability of our algorithm with comprehensive evaluations over both synthetic and real world data sets, many of which are much larger than those used in recent fair-clustering methods.

artificial intelligence, machine learning, objective, (15 more...)

arXiv.org Machine Learning

1906.08207

Genre: Research Report (0.50)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (1.00)

Add feedback

On The Radon--Nikodym Spectral Approach With Optimal Clustering

Malyshkin, Vladislav Gennadievich

arXiv.org Machine LearningJun-19-2019

Problems of interpolation, classification, and clustering are considered. In the tenets of Radon--Nikodym approach $\langle f(\mathbf{x})\psi^2 \rangle / \langle\psi^2\rangle$, where the $\psi(\mathbf{x})$ is a linear function on input attributes, all the answers are obtained from a generalized eigenproblem $|f|\psi^{[i]}\rangle = \lambda^{[i]} |\psi^{[i]}\rangle$. The solution to the interpolation problem is a regular Radon-Nikodym derivative. The solution to the classification problem requires prior and posterior probabilities that are obtained using the Lebesgue quadrature[1] technique. Whereas in a Bayesian approach new observations change only outcome probabilities, in the Radon-Nikodym approach not only outcome probabilities but also the probability space $|\psi^{[i]}\rangle$ change with new observations. This is a remarkable feature of the approach: both the probabilities and the probability space are constructed from the data. The Lebesgue quadrature technique can be also applied to the optimal clustering problem. The problem is solved by constructing a Gaussian quadrature on the Lebesgue measure. A distinguishing feature of the Radon-Nikodym approach is the knowledge of the invariant group: all the answers are invariant relatively any non-degenerated linear transform of input vector $\mathbf{x}$ components. A software product implementing the algorithms of interpolation, classification, and optimal clustering is available from the authors.

artificial intelligence, machine learning, quadrature, (18 more...)

arXiv.org Machine Learning

1906.0046

Country: North America > United States > Wisconsin (0.15)

Genre: Research Report (0.64)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.66)
Information Technology > Artificial Intelligence > Representation & Reasoning > Uncertainty > Bayesian Inference (0.48)
Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Directed Networks > Bayesian Learning (0.34)

Add feedback

Supervised Hierarchical Clustering with Exponential Linkage

Yadav, Nishant, Kobren, Ari, Monath, Nicholas, McCallum, Andrew

arXiv.org Machine LearningJun-18-2019

In supervised clustering, standard techniques for learning a pairwise dissimilarity function often suffer from a discrepancy between the training and clustering objectives, leading to poor cluster quality. Rectifying this discrepancy necessitates matching the procedure for training the dissimilarity function to the clustering algorithm. In this paper, we introduce a method for training the dissimilarity function in a way that is tightly coupled with hierarchical clustering, in particular single linkage. However, the appropriate clustering algorithm for a given dataset is often unknown. Thus we introduce an approach to supervised hierarchical clustering that smoothly interpolates between single, average, and complete linkage, and we give a training procedure that simultaneously learns a linkage function and a dissimilarity function. We accomplish this with a novel Exponential Linkage function that has a learnable parameter that controls the interpolation. In experiments on four datasets, our joint training procedure consistently matches or outperforms the next best training procedure/linkage function pair and gives up to 8 points improvement in dendrogram purity over discrepant pairs.

artificial intelligence, data mining, machine learning, (14 more...)

arXiv.org Machine Learning

1906.07859

Country:

Asia > Afghanistan > Parwan Province > Charikar (0.05)
North America > United States > Massachusetts > Hampshire County > Amherst (0.04)
North America > United States > California > Los Angeles County > Long Beach (0.04)
(2 more...)

Genre:

Research Report > New Finding (0.46)
Research Report > Experimental Study (0.46)

Technology:

Information Technology > Data Science > Data Mining (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (1.00)

Add feedback

From Clustering to Cluster Explanations via Neural Networks

Kauffmann, Jacob, Esders, Malte, Montavon, Grégoire, Samek, Wojciech, Müller, Klaus-Robert

arXiv.org Machine LearningJun-18-2019

A wealth of algorithms have been developed to extract natural cluster structure in data. Identifying this structure is desirable but not always sufficient: We may also want to understand why the data points have been assigned to a given cluster. Clustering algorithms do not offer a systematic answer to this simple question. Hence we propose a new framework that can, for the first time, explain cluster assignments in terms of input features in a comprehensive manner. It is based on the novel theoretical insight that clustering models can be rewritten as neural networks, or 'neuralized'. Predictions of the obtained networks can then be quickly and accurately attributed to the input features. Several showcases demonstrate the ability of our method to assess the quality of learned clusters and to extract novel insights from the analyzed data and representations.

artificial intelligence, data mining, machine learning, (17 more...)

arXiv.org Machine Learning

1906.07633

Country:

Europe (0.68)
North America > United States (0.67)

Genre: Research Report (0.82)

Industry:

Health & Medicine > Therapeutic Area (0.46)
Health & Medicine > Pharmaceuticals & Biotechnology (0.46)

Technology:

Information Technology > Data Science > Data Mining (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)

Add feedback

Active Learning by Greedy Split and Label Exploration

Herbst, Alyssa, Huang, Bert

arXiv.org Machine LearningJun-17-2019

Annotating large unlabeled datasets can be a major bottleneck for machine learning applications. We introduce a scheme for inferring labels of unlabeled data at a fraction of the cost of labeling the entire dataset. We refer to the scheme as greedy split and label exploration (GSAL). GSAL greedily queries an oracle (or human labeler) and partitions a dataset to find data subsets that have mostly the same label. GSAL can then infer labels by majority vote of the known labels in each subset. GSAL makes the decision to split or label from a subset by maximizing a lower bound on the expected number of correctly labeled examples. GSAL improves upon existing hierarchical labeling schemes by using supervised models to partition the data, therefore avoiding reliance on unsupervised clustering methods that may not accurately group data by label. We design GSAL with strategies to avoid bias that could be introduced through this adaptive partitioning. We evaluate GSAL on labeling of three datasets and find that it outperforms existing strategies for adaptive labeling.

artificial intelligence, dataset, machine learning, (17 more...)

arXiv.org Machine Learning

1906.07046

Country:

North America > United States > Virginia > Montgomery County > Blacksburg (0.04)
Asia > Middle East > Jordan (0.04)

Genre: Research Report (0.64)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (1.00)

Add feedback

Dataset shift quantification for credit card fraud detection

Lucas, Yvan, Portier, Pierre-Edouard, Laporte, Léa, Calabretto, Sylvie, He-Guelton, Liyun, Oblé, Frederic, Granitzer, Michael

arXiv.org Artificial IntelligenceJun-17-2019

Machine learning and data mining techniques have been used extensively in order to detect credit card frauds. However purchase behaviour and fraudster strategies may change over time. This phenomenon is named dataset shift or concept drift in the domain of fraud detection. In this paper, we present a method to quantify day-by-day the dataset shift in our face-to-face credit card transactions dataset (card holder located in the shop) . In practice, we classify the days against each other and measure the efficiency of the classification. The more efficient the classification, the more different the buying behaviour between two days, and vice versa. Therefore, we obtain a distance matrix characterizing the dataset shift. After an agglomerative clustering of the distance matrix, we observe that the dataset shift pattern matches the calendar events for this time period (holidays, week-ends, etc). We then incorporate this dataset shift knowledge in the credit card fraud detection task as a new feature. This leads to a small improvement of the detection.

artificial intelligence, machine learning, transaction, (13 more...)

arXiv.org Artificial Intelligence

1906.06977

Genre: Research Report (0.50)

Industry:

Law Enforcement & Public Safety > Fraud (1.00)
Banking & Finance > Credit (1.00)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.69)

Add feedback

Topological Data Analysis of Time Series Data for B2B Customer Relationship Management

Rivera-Castro, Rodrigo, Pilyugina, Polina, Pletnev, Alexander, Maksimov, Ivan, Wyz, Wanyi, Burnaev, Evgeny

arXiv.org Machine LearningJun-17-2019

Topological Data Analysis (TDA) is a recent approach to analyze data sets from the perspective of their topological structure. Its use for time series data has been limited to the field of financial time series primarily and as a method for feature generation in machine learning applications. In this work, TDA is presented as a technique to gain additional understanding of the customers' loyalty for business-to-business customer relationship management. Increasing loyalty and strengthening relationships with key accounts remain an active topic of discussion both for researchers and managers. Using two public and two proprietary data sets of commercial data, this research shows that the technique enables analysts to better understand their customer base and identify prospective opportunities. In addition, the approach can be used as a clustering method to increase the accuracy of a predictive model for loyalty scoring. This work thus seeks to introduce TDA as a viable tool for data analysis to the quantitate marketing practitioner.

artificial intelligence, data mining, machine learning, (13 more...)

arXiv.org Machine Learning

1906.03956

Country:

Africa > Central African Republic > Ombella-M'Poko > Bimbo (0.06)
North America > United States > New Jersey > Hudson County > Hoboken (0.04)
Europe > Spain (0.04)
Asia > Vietnam (0.04)

Genre: Research Report (0.64)

Industry: Marketing (0.36)

Technology:

Information Technology > Enterprise Applications > Customer Relationship Management (1.00)
Information Technology > Data Science > Data Mining (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.67)

Add feedback

K-Means Explained

#artificialintelligenceJun-16-2019, 17:37:47 GMT

K-Means, a method of vector quantization that is popular for cluster analysis in data mining, is about choosing the number of clusters, selecting the centroids (not necessarily from the dataset) at random K points, assigning each data point to the closest centroid (forming K clusters), computing and placing the new centroids of each cluster, reassigning each data point to the new closest centroid, and keep repeating the last step until no reassignment takes place. WCSS (Within-Cluster-Sum-of-Squares) is calculated to allow choosing the appropriate number of clusters: the minimal WCSS (decreased to a limit) is chosen as the right number of clusters. Once the number of clusters is chosen, centroids are to be selected, and data points to be assigned to the closet centroids. Afterwards, new centroids are being chosen in the middle of each cluster, and data points are being reassigned to the corresponding cluster. P.S.: k-means is used to prevent choosing wrong initial values, centroids leading to clusters not being the most appropriate.

artificial intelligence, centroid, machine learning, (5 more...)

#artificialintelligence

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (1.00)

Add feedback