Clustering
Coresets for Clustering with Fairness Constraints
Huang, Lingxiao, Jiang, Shaofeng H. -C., Vishnoi, Nisheeth K.
In a recent work, Chierichetti et al. studied the following "fair" variants of classical clustering problems such as $k$-means and $k$-median: given a set of $n$ data points in $\mathbb{R}^d$ and a binary type associated to each data point, the goal is to cluster the points while ensuring that the proportion of each type in each cluster is roughly the same as its underlying proportion. Subsequent work has focused on either extending this setting to when each data point has multiple, non-disjoint sensitive types such as race and gender, or to address the problem that the clustering algorithms in the above work do not scale well. The main contribution of this paper is an approach to clustering with fairness constraints that involve multiple, non-disjoint types, that is {\em also scalable}. Our approach is based on novel constructions of coresets: for the $k$-median objective, we construct an $\varepsilon$-coreset of size $O(\Gamma k^2 \varepsilon^{-d})$ where $\Gamma$ is the number of distinct collections of groups that a point may belong to, and for the $k$-means objective, we show how to construct an $\varepsilon$-coreset of size $O(\Gamma k^3\varepsilon^{-d-1})$. The former result is the first known coreset construction for the fair clustering problem with the $k$-median objective, and the latter result removes the dependence on the size of the full dataset as in Schmidt et al. and generalizes it to multiple, non-disjoint types. Plugging our coresets into existing algorithms for fair clustering such as Backurs et al. results in the fastest algorithms for several cases. Empirically, we assess our approach over the Adult and Bank dataset, and show that the coreset sizes are much smaller than the full dataset; applying coresets indeed accelerates the running time of computing the fair clustering objective while ensuring that the resulting objective difference is small.
Constrained Bilinear Factorization Multi-view Subspace Clustering
Zheng, Qinghai, Zhu, Jihua, Tian, Zhiqiang, Li, Zhongyu, Pang, Shanmin, Jia, Xiuyi
Multi-view clustering is an important and fundamental problem. Many multi-view subspace clustering methods have been proposed and achieved success in real-world applications, most of which assume that all views share a same coefficient matrix. However, the underlying information of multiview data are not exploited effectively under this assumption, since the coefficient matrices of different views should have the same clustering properties rather than be the same among multiple views. To this end, a novel Constrained Bilinear Factorization Multi-view Subspace Clustering (CBF-MSC) method is proposed in this paper. Specifically, the bilinear factorization with an orthonormality constraint and a low-rank constraint is employed for all coefficient matrices to make all coefficient matrices have the same trace-norm instead of being equivalent, so as to explore the consensus information of multi-view data more effectively. Finally, an algorithm based on the Augmented Lagrangian Multiplier (ALM) scheme with alternating direction minimization is designed to optimize the objective function. Comprehensive experiments tested on six benchmark datasets validate the effectiveness and competitiveness of the proposed approach compared with several state-of-the-art approaches.
Clustering with Fairness Constraints: A Flexible and Scalable Approach
Ziko, Imtiaz Masud, Granger, Eric, Yuan, Jing, Ayed, Ismail Ben
This study investigates a general variational formulation of fair clustering, which can integrate fairness constraints with a large class of clustering objectives. Unlike the existing methods, our formulation can impose any desired (target) demographic proportions within each cluster. Furthermore, it enables to control the trade-off between fairness and the clustering objective. We derive an auxiliary function (tight upper bound) of our KL-based fairness penalty via its concave-convex decomposition and Lipschitz-gradient property. Our upper bound can be optimized jointly with various clustering objectives, including both prototype-based such as K-means and graph-based such as Normalized Cut. Interestingly, at each iteration, our general fair-clustering algorithm performs an independent update for each assignment variable, while guaranteeing convergence. Therefore, it can be easily distributed for large-scale data sets. Such scalability is important as it enables to explore different trade-off levels between fairness and clustering objectives. Unlike existing fairness-constrained spectral clustering, our formulation does not need storing an affinity matrix and computing its eigenvalue decomposition. Moreover, unlike existing prototype-based methods, our experiments reveal that fairness does not come at a significant cost of the clustering objective. In fact, several of our tests showed that our fairness penalty helped to avoid weak local minima of the clustering objective (i.e., with fairness, we obtained better clustering objectives). We demonstrate the flexibility and scalability of our algorithm with comprehensive evaluations over both synthetic and real world data sets, many of which are much larger than those used in recent fair-clustering methods.
On The Radon--Nikodym Spectral Approach With Optimal Clustering
Malyshkin, Vladislav Gennadievich
Problems of interpolation, classification, and clustering are considered. In the tenets of Radon--Nikodym approach $\langle f(\mathbf{x})\psi^2 \rangle / \langle\psi^2\rangle$, where the $\psi(\mathbf{x})$ is a linear function on input attributes, all the answers are obtained from a generalized eigenproblem $|f|\psi^{[i]}\rangle = \lambda^{[i]} |\psi^{[i]}\rangle$. The solution to the interpolation problem is a regular Radon-Nikodym derivative. The solution to the classification problem requires prior and posterior probabilities that are obtained using the Lebesgue quadrature[1] technique. Whereas in a Bayesian approach new observations change only outcome probabilities, in the Radon-Nikodym approach not only outcome probabilities but also the probability space $|\psi^{[i]}\rangle$ change with new observations. This is a remarkable feature of the approach: both the probabilities and the probability space are constructed from the data. The Lebesgue quadrature technique can be also applied to the optimal clustering problem. The problem is solved by constructing a Gaussian quadrature on the Lebesgue measure. A distinguishing feature of the Radon-Nikodym approach is the knowledge of the invariant group: all the answers are invariant relatively any non-degenerated linear transform of input vector $\mathbf{x}$ components. A software product implementing the algorithms of interpolation, classification, and optimal clustering is available from the authors.
Supervised Hierarchical Clustering with Exponential Linkage
Yadav, Nishant, Kobren, Ari, Monath, Nicholas, McCallum, Andrew
In supervised clustering, standard techniques for learning a pairwise dissimilarity function often suffer from a discrepancy between the training and clustering objectives, leading to poor cluster quality. Rectifying this discrepancy necessitates matching the procedure for training the dissimilarity function to the clustering algorithm. In this paper, we introduce a method for training the dissimilarity function in a way that is tightly coupled with hierarchical clustering, in particular single linkage. However, the appropriate clustering algorithm for a given dataset is often unknown. Thus we introduce an approach to supervised hierarchical clustering that smoothly interpolates between single, average, and complete linkage, and we give a training procedure that simultaneously learns a linkage function and a dissimilarity function. We accomplish this with a novel Exponential Linkage function that has a learnable parameter that controls the interpolation. In experiments on four datasets, our joint training procedure consistently matches or outperforms the next best training procedure/linkage function pair and gives up to 8 points improvement in dendrogram purity over discrepant pairs.
From Clustering to Cluster Explanations via Neural Networks
Kauffmann, Jacob, Esders, Malte, Montavon, Grรฉgoire, Samek, Wojciech, Mรผller, Klaus-Robert
A wealth of algorithms have been developed to extract natural cluster structure in data. Identifying this structure is desirable but not always sufficient: We may also want to understand why the data points have been assigned to a given cluster. Clustering algorithms do not offer a systematic answer to this simple question. Hence we propose a new framework that can, for the first time, explain cluster assignments in terms of input features in a comprehensive manner. It is based on the novel theoretical insight that clustering models can be rewritten as neural networks, or 'neuralized'. Predictions of the obtained networks can then be quickly and accurately attributed to the input features. Several showcases demonstrate the ability of our method to assess the quality of learned clusters and to extract novel insights from the analyzed data and representations.
Active Learning by Greedy Split and Label Exploration
Annotating large unlabeled datasets can be a major bottleneck for machine learning applications. We introduce a scheme for inferring labels of unlabeled data at a fraction of the cost of labeling the entire dataset. We refer to the scheme as greedy split and label exploration (GSAL). GSAL greedily queries an oracle (or human labeler) and partitions a dataset to find data subsets that have mostly the same label. GSAL can then infer labels by majority vote of the known labels in each subset. GSAL makes the decision to split or label from a subset by maximizing a lower bound on the expected number of correctly labeled examples. GSAL improves upon existing hierarchical labeling schemes by using supervised models to partition the data, therefore avoiding reliance on unsupervised clustering methods that may not accurately group data by label. We design GSAL with strategies to avoid bias that could be introduced through this adaptive partitioning. We evaluate GSAL on labeling of three datasets and find that it outperforms existing strategies for adaptive labeling.
Dataset shift quantification for credit card fraud detection
Lucas, Yvan, Portier, Pierre-Edouard, Laporte, Lรฉa, Calabretto, Sylvie, He-Guelton, Liyun, Oblรฉ, Frederic, Granitzer, Michael
Machine learning and data mining techniques have been used extensively in order to detect credit card frauds. However purchase behaviour and fraudster strategies may change over time. This phenomenon is named dataset shift or concept drift in the domain of fraud detection. In this paper, we present a method to quantify day-by-day the dataset shift in our face-to-face credit card transactions dataset (card holder located in the shop) . In practice, we classify the days against each other and measure the efficiency of the classification. The more efficient the classification, the more different the buying behaviour between two days, and vice versa. Therefore, we obtain a distance matrix characterizing the dataset shift. After an agglomerative clustering of the distance matrix, we observe that the dataset shift pattern matches the calendar events for this time period (holidays, week-ends, etc). We then incorporate this dataset shift knowledge in the credit card fraud detection task as a new feature. This leads to a small improvement of the detection.
Topological Data Analysis of Time Series Data for B2B Customer Relationship Management
Rivera-Castro, Rodrigo, Pilyugina, Polina, Pletnev, Alexander, Maksimov, Ivan, Wyz, Wanyi, Burnaev, Evgeny
Topological Data Analysis (TDA) is a recent approach to analyze data sets from the perspective of their topological structure. Its use for time series data has been limited to the field of financial time series primarily and as a method for feature generation in machine learning applications. In this work, TDA is presented as a technique to gain additional understanding of the customers' loyalty for business-to-business customer relationship management. Increasing loyalty and strengthening relationships with key accounts remain an active topic of discussion both for researchers and managers. Using two public and two proprietary data sets of commercial data, this research shows that the technique enables analysts to better understand their customer base and identify prospective opportunities. In addition, the approach can be used as a clustering method to increase the accuracy of a predictive model for loyalty scoring. This work thus seeks to introduce TDA as a viable tool for data analysis to the quantitate marketing practitioner.
K-Means Explained
K-Means, a method of vector quantization that is popular for cluster analysis in data mining, is about choosing the number of clusters, selecting the centroids (not necessarily from the dataset) at random K points, assigning each data point to the closest centroid (forming K clusters), computing and placing the new centroids of each cluster, reassigning each data point to the new closest centroid, and keep repeating the last step until no reassignment takes place. WCSS (Within-Cluster-Sum-of-Squares) is calculated to allow choosing the appropriate number of clusters: the minimal WCSS (decreased to a limit) is chosen as the right number of clusters. Once the number of clusters is chosen, centroids are to be selected, and data points to be assigned to the closet centroids. Afterwards, new centroids are being chosen in the middle of each cluster, and data points are being reassigned to the corresponding cluster. P.S.: k-means is used to prevent choosing wrong initial values, centroids leading to clusters not being the most appropriate.