Clustering
Robust Finite Mixture Regression for Heterogeneous Targets
Liang, Jian, Chen, Kun, Lin, Ming, Zhang, Changshui, Wang, Fei
Finite Mixture Regression (FMR) refers to the mixture modeling scheme which learns multiple regression models from the training data set. Each of them is in charge of a subset. FMR is an effective scheme for handling sample heterogeneity, where a single regression model is not enough for capturing the complexities of the conditional distribution of the observed samples given the features. In this paper, we propose an FMR model that 1) finds sample clusters and jointly models multiple incomplete mixed-type targets simultaneously, 2) achieves shared feature selection among tasks and cluster components, and 3) detects anomaly tasks or clustered structure among tasks, and accommodates outlier samples. We provide non-asymptotic oracle performance bounds for our model under a high-dimensional learning framework. The proposed model is evaluated on both synthetic and real-world data sets. The results show that our model can achieve state-of-the-art performance.
Representativity Fairness in Clustering
P, Deepak, Abraham, Savitha Sam
Incorporating fairness constructs into machine learning algorithms is a topic of much societal importance and recent interest. Clustering, a fundamental task in unsupervised learning that manifests across a number of web data scenarios, has also been subject of attention within fair ML research. In this paper, we develop a novel notion of fairness in clustering, called representativity fairness. Representativity fairness is motivated by the need to alleviate disparity across objects' proximity to their assigned cluster representatives, to aid fairer decision making. We illustrate the importance of representativity fairness in real-world decision making scenarios involving clustering and provide ways of quantifying objects' representativity and fairness over it. We develop a new clustering formulation, RFKM, that targets to optimize for representativity fairness along with clustering quality. Inspired by the $K$-Means framework, RFKM incorporates novel loss terms to formulate an objective function. The RFKM objective and optimization approach guides it towards clustering configurations that yield higher representativity fairness. Through an empirical evaluation over a variety of public datasets, we establish the effectiveness of our method. We illustrate that we are able to significantly improve representativity fairness at only marginal impact to clustering quality.
Near-Optimal Comparison Based Clustering
Perrot, Michaël, Esser, Pascal Mattia, Ghoshdastidar, Debarghya
The goal of clustering is to group similar objects into meaningful partitions. This process is well understood when an explicit similarity measure between the objects is given. However, far less is known when this information is not readily available and, instead, one only observes ordinal comparisons such as "object i is more similar to j than to k." In this paper, we tackle this problem using a two-step procedure: we estimate a pairwise similarity matrix from the comparisons before using a clustering method based on semi-definite programming (SDP). We theoretically show that our approach can exactly recover a planted clustering using a near-optimal number of passive comparisons. We empirically validate our theoretical findings and demonstrate the good behaviour of our method on real data.
HAMLET: A Hierarchical Agent-based Machine Learning Platform
Esmaeili, Ahmad, Gallagher, John C., Springer, John A., Matson, Eric T.
Hierarchical Multi-Agent Systems provide a convenient and relevant way to analyze, model, and simulate complex systems in which a large number of entities are interacting at different levels of abstraction. In this paper, we introduce HAMLET (Hierarchical Agent-based Machine LEarning plaTform), a platform based on hierarchical multi-agent systems, to facilitate the research and democratization of machine learning entities distributed geographically or locally. This is carried out by firstly modeling the machine learning solutions as a hypergraph and then autonomously setting up a multi-level structure composed of heterogeneous agents based on their innate capabilities and learned skills. HAMLET aids the design and management of machine learning systems and provides analytical capabilities for the research communities to assess the existing and/or new algorithms/datasets through flexible and customizable queries. The proposed platform does not assume restrictions on the type of machine learning algorithms/datasets and is theoretically proven to be sound and complete with polynomial computational requirements. Additionally, it is examined empirically on 120 training and four generalized batch testing tasks performed on 24 machine learning algorithms and 9 standard datasets. The experimental results provided not only establish confidence in the platform's consistency and correctness but also demonstrates its testing and analytical capacity.
Learning Binary Trees via Sparse Relaxation
Zantedeschi, Valentina, Kusner, Matt J., Niculae, Vlad
One of the most classical problems in machine learning is how to learn binary trees that split data into useful partitions. From classification/regression via decision trees to hierarchical clustering, binary trees are useful because they (a) are often easy to visualize; (b) make computationally-efficient predictions; and (c) allow for flexible partitioning. Because of this there has been extensive research on how to learn such trees that generally fall into one of three categories: 1. greedy node-by-node optimization; 2. probabilistic relaxations for differentiability; 3. mixed-integer programs (MIP). Each of these have downsides: greedy can myopically choose poor splits, probabilistic relaxations do not have principled ways to prune trees, MIP methods can be slow on large problems and may not generalize. In this work we derive a novel sparse relaxation for binary tree learning. By deriving a new MIP and sparsely relaxing it, our approach is able to learn tree splits and tree pruning using argmin differentiation. We demonstrate how our approach is easily visualizable and is competitive with current tree-based approaches in classification/regression and hierarchical clustering. Source code is available at http://github.com/vzantedeschi/LatentTrees .
A Critique of Self-Expressive Deep Subspace Clustering
Haeffele, Benjamin D., You, Chong, Vidal, René
Subspace clustering is an unsupervised clustering technique designed to cluster data that is supported on a union of linear subspaces, with each subspace defining a cluster with dimension lower than the ambient space. Many existing formulations for this problem are based on exploiting the self-expressive property of linear subspaces, where any point within a subspace can be represented as linear combination of other points within the subspace. To extend this approach to data supported on a union of non-linear manifolds, numerous studies have proposed learning an appropriate kernel embedding of the original data using a neural network, which is regularized by a self-expressive loss function on the data in the embedded space to encourage a union of linear subspaces prior on the data in the embedded space. Here we show that there are a number of potential flaws with this approach which have not been adequately addressed in prior work. In particular, we show the model formulation is often ill-posed in multiple ways, which can lead to a degenerate embedding of the data, which need not correspond to a union of subspaces at all. We validate our theoretical results experimentally and additionally repeat prior experiments reported in the literature, where we conclude that a significant portion of the previously claimed performance benefits can be attributed to an ad-hoc post processing step rather than the clustering model.
LOGAN: Local Group Bias Detection by Clustering
Machine learning techniques have been widely used in natural language processing (NLP). However, as revealed by many recent studies, machine learning models often inherit and amplify the societal biases in data. Various metrics have been proposed to quantify biases in model predictions. In particular, several of them evaluate disparity in model performance between protected groups and advantaged groups in the test corpus. However, we argue that evaluating bias at the corpus level is not enough for understanding how biases are embedded in a model. In fact, a model with similar aggregated performance between different groups on the entire data may behave differently on instances in a local region. To analyze and detect such local bias, we propose LOGAN, a new bias detection technique based on clustering. Experiments on toxicity classification and object classification tasks show that LOGAN identifies bias in a local region and allows us to better analyze the biases in model predictions.
Ensemble Machine Learning Methods for Modeling COVID19 Deaths
Bathwal, R., Chitta, P., Tirumala, K., Varadarajan, V.
Using a hybrid of machine learning and epidemiological approaches, we propose a novel data-driven approach in predicting US COVID-19 deaths at a county level. The model gives a more complete description of the daily death distribution, outputting quantile-estimates instead of mean deaths, where the model's objective is to minimize the pinball loss on deaths reported by the New York Times coronavirus county dataset. The resulting quantile estimates accurately forecast deaths at an individual-county level for a variable-length forecast period, and the approach generalizes well across different forecast period lengths. We won the Caltech-run modeling competition out of 50+ teams, and our aggregate is competitive with the best COVID-19 modeling systems (on root mean squared error).
"Drunk Man" Saves Our Lives: Route Planning by a Biased Random Walk Mode
Hu, Xinyi, Miao, Quchen, Zhao, Zexuan
Based on the hurricane struking Puerto Rico in 2017, we developed a transportable disaster response system "DroneGo" featuring a drone fleet capable of delivering medical package and videoing roads. Assuming equal weight for both mission, we take the capability of carrying out the former missions as a constraint and a starting point from which reconnaissance routes are built. The feasibility of fitting packages into cargo bay 1 or 2 is tested by genetic algorithm. In scenario where drones carry packages to and unloaded back, from specification of drones and loading weight can we derive the maximum reachable distance of each drone loaded. A k-means clustering algorithm is used for partitioning destinations and deriving centroids as locations of bases.
EGMM: an Evidential Version of the Gaussian Mixture Model for Clustering
Jiao, Lianmeng, Denoeux, Thierry, Liu, Zhun-ga, Pan, Quan
The Gaussian mixture model (GMM) provides a convenient yet principled framework for clustering, with properties suitable for statistical inference. In this paper, we propose a new model-based clustering algorithm, called EGMM (evidential GMM), in the theoretical framework of belief functions to better characterize cluster-membership uncertainty. With a mass function representing the cluster membership of each object, the evidential Gaussian mixture distribution composed of the components over the powerset of the desired clusters is proposed to model the entire dataset. The parameters in EGMM are estimated by a specially designed Expectation-Maximization (EM) algorithm. A validity index allowing automatic determination of the proper number of clusters is also provided. The proposed EGMM is as convenient as the classical GMM, but can generate a more informative evidential partition for the considered dataset. Experiments with synthetic and real datasets demonstrate the good performance of the proposed method as compared with some other prototype-based and model-based clustering techniques.