Goto

Collaborating Authors

 Clustering


Coresets for Clustering with Fairness Constraints

Neural Information Processing Systems

In a recent work, \cite{chierichetti2017fair} studied the following fair'' variants of classical clustering problems such as k-means and k-median: given a set of n data points in R d and a binary type associated to each data point, the goal is to cluster the points while ensuring that the proportion of each type in each cluster is roughly the same as its underlying proportion. Subsequent work has focused on either extending this setting to when each data point has multiple, non-disjoint sensitive types such as race and gender \cite{bera2019fair}, or to address the problem that the clustering algorithms in the above work do not scale well. The main contribution of this paper is an approach to clustering with fairness constraints that involve {\em multiple, non-disjoint} attributes, that is {\em also scalable}. Our approach is based on novel constructions of coresets: for the k-median objective, we construct an \eps-coreset of size O(\Gamma k 2 \eps {-d}) where \Gamma is the number of distinct collections of groups that a point may belong to, and for the k-means objective, we show how to construct an \eps-coreset of size O(\Gamma k 3\eps {-d-1}). The former result is the first known coreset construction for the fair clustering problem with the k-median objective, and the latter result removes the dependence on the size of the full dataset as in \cite{schmidt2018fair} and generalizes it to multiple, non-disjoint attributes.


Attracting and Dispersing: A Simple Approach for Source-free Domain Adaptation

Neural Information Processing Systems

We propose a simple but effective source-free domain adaptation (SFDA) method. Treating SFDA as an unsupervised clustering problem and following the intuition that local neighbors in feature space should have more similar predictions than other features, we propose to optimize an objective of prediction consistency. This objective encourages local neighborhood features in feature space to have similar predictions while features farther away in feature space have dissimilar predictions, leading to efficient feature clustering and cluster assignment simultaneously. For efficient training, we seek to optimize an upper-bound of the objective resulting in two simple terms. Furthermore, we relate popular existing methods in domain adaptation, source-free domain adaptation and contrastive learning via the perspective of discriminability and diversity.


Sliding Window Algorithms for k-Clustering Problems

Neural Information Processing Systems

The sliding window model of computation captures scenarios in which data is arriving continuously, but only the latest w elements should be used for analysis. The goal is to design algorithms that update the solution efficiently with each arrival rather than recomputing it from scratch. In this work, we focus on k -clustering problems such as k -means and k -median. In this setting, we provide simple and practical algorithms that offer stronger performance guarantees than previous results. Empirically, we show that our methods store only a small fraction of the data, are orders of magnitude faster, and find solutions with costs only slightly higher than those returned by algorithms with access to the full dataset.


Hierarchical Clustering: O(1) -Approximation for Well-Clustered Graphs

Neural Information Processing Systems

Hierarchical clustering studies a recursive partition of a data set into clusters of successively smaller size, and is a fundamental problem in data analysis. In this work we study the cost function for hierarchical clustering introduced by Dasgupta, and present two polynomial-time approximation algorithms: Our first result is an O(1) -approximation algorithm for graphs of high conductance. Our simple construction bypasses complicated recursive routines of finding sparse cuts known in the literature. Our second and main result is an O(1) -approximation algorithm for a wide family of graphs that exhibit a well-defined structure of clusters. This result generalises the previous state-of-the-art, which holds only for graphs generated from stochastic models.


Label consistency in overfitted generalized k -means

Neural Information Processing Systems

We provide theoretical guarantees for label consistency in generalized k -means problems, with an emphasis on the overfitted case where the number of clusters used by the algorithm is more than the ground truth. We provide conditions under which the estimated labels are close to a refinement of the true cluster labels. We consider both exact and approximate recovery of the labels. Our results hold for any constant-factor approximation to the k -means problem. The results are also model-free and only based on bounds on the maximum or average distance of the data points to the true cluster centers.


k-Means Clustering of Lines for Big Data

Neural Information Processing Systems

The input to the \emph{ k -mean for lines} problem is a set L of n lines in \mathbb{R} d, and the goal is to compute a set of k centers (points) in \mathbb{R} d that minimizes the sum of squared distances over every line in L and its nearest center. This is a straightforward generalization of the k -mean problem where the input is a set of n points instead of lines. We suggest the first PTAS that computes a (1 \epsilon) -approximation to this problem in time O(n \log n) for any constant approximation error \epsilon \in (0, 1), and constant integers k, d \geq 1 . This is by proving that there is always a weighted subset (called coreset) of dk {O(k)}\log (n)/\epsilon 2 lines in L that approximates the sum of squared distances from L to \emph{any} given set of k points. Using traditional merge-and-reduce technique, this coreset implies results for a streaming set (possibly infinite) of lines to M machines in one pass (e.g.


Machine Learning for Missing Value Imputation

arXiv.org Artificial Intelligence

In recent times, a considerable number of research studies have been carried out to address the issue of Missing Value Imputation (MVI). MVI aims to provide a primary solution for datasets that have one or more missing attribute values. The advancements in Artificial Intelligence (AI) drive the development of new and improved machine learning (ML) algorithms and methods. The advancements in ML have opened up significant opportunities for effectively imputing these missing values. The main objective of this article is to conduct a comprehensive and rigorous review, as well as analysis, of the state-of-the-art ML applications in MVI methods. This analysis seeks to enhance researchers' understanding of the subject and facilitate the development of robust and impactful interventions in data preprocessing for Data Analytics. The review is performed following the Preferred Reporting Items for Systematic Reviews and Meta-Analysis (PRISMA) technique. More than 100 articles published between 2014 and 2023 are critically reviewed, considering the methods and findings. Furthermore, the latest literature is examined to scrutinize the trends in MVI methods and their evaluation. The accomplishments and limitations of the existing literature are discussed in detail. The survey concludes by identifying the current gaps in research and providing suggestions for future research directions and emerging trends in related fields of interest.


From Logits to Hierarchies: Hierarchical Clustering made Simple

arXiv.org Artificial Intelligence

The structure of many real-world datasets is intrinsically hierarchical, making the modeling of such hierarchies a critical objective in both unsupervised and supervised machine learning. Recently, novel approaches for hierarchical clustering with deep architectures have been proposed. In this work, we take a critical perspective on this line of research and demonstrate that many approaches exhibit major limitations when applied to realistic datasets, partly due to their high computational complexity. In particular, we show that a lightweight procedure implemented on top of pre-trained non-hierarchical clustering models outperforms models designed specifically for hierarchical clustering. Our proposed approach is computationally efficient and applicable to any pre-trained clustering model that outputs logits, without requiring any fine-tuning. To highlight the generality of our findings, we illustrate how our method can also be applied in a supervised setup, recovering meaningful hierarchies from a pre-trained ImageNet classifier.


Linguistically-Informed Multilingual Instruction Tuning: Is There an Optimal Set of Languages to Tune?

arXiv.org Artificial Intelligence

Multilingual language models often perform unevenly across different languages due to limited generalization capabilities for some languages. This issue is significant because of the growing interest in making universal language models that work well for all languages. Instruction tuning with multilingual instruction-response pairs has been used to improve model performance across various languages. However, this approach is challenged by high computational costs, a lack of quality tuning data for all languages, and the "curse of multilinguality" -- the performance drop per language after adding many languages. Recent studies have found that working with datasets with few languages and a smaller number of instances can be beneficial. Yet, there exists no systematic investigation into how choosing different languages affects multilingual instruction tuning. Our study proposes a method to select languages for instruction tuning in a linguistically informed way, aiming to boost model performance across languages and tasks. We use a simple algorithm to choose diverse languages and test their effectiveness on various benchmarks and open-ended questions. Our results show that this careful selection generally leads to better outcomes than choosing languages at random. We suggest a new and simple way of enhancing multilingual models by selecting diverse languages based on linguistic features that could help develop better multilingual systems and guide dataset creation efforts. All resources, including the code for language selection and multilingual instruction tuning, are made available in our official repository at https://github.com/GGLAB-KU/ling-informed-mit enabling reproducibility and further research in this area.


Strongly local p-norm-cut algorithms for semi-supervised learning and local graph clustering

Neural Information Processing Systems

Graph based semi-supervised learning is the problem of learning a labeling function for the graph nodes given a few example nodes, often called seeds, usually under the assumption that the graph's edges indicate similarity of labels. This is closely related to the local graph clustering or community detection problem of finding a cluster or community of nodes around a given seed. For this problem, we propose a novel generalization of random walk, diffusion, or smooth function methods in the literature to a convex p-norm cut function. The need for our p-norm methods is that, in our study of existing methods, we find those principled methods based on eigenvector, spectral, random walk, or linear system often have difficulty capturing the correct boundary of a target label or target cluster. In contrast, 1-norm or maxflow-mincut based methods capture the boundary, but cannot grow from small seed set; hybrid procedures that use both have many hard to set parameters.