Clustering
Review for NeurIPS paper: Strongly local p-norm-cut algorithms for semi-supervised learning and local graph clustering
Summary and Contributions: This paper introduces a new class of algorithms for local graph clustering, based on Lp objectives (for p 2) rather than the standard L2 (used in reference works like Anderson-Chang-Lang), or L_infinity (which corresponds to minimum cut). The authors make a clear point that the p 2 regime corresponds to the primal objective which optimizes in the space of flows, whose dual problem corresponds to optimizing an Lq norm (1 q 2) in the space of cuts. For the purpose of this paper, they stick to the cut problem, since it is more naturally connected to the algorithm described here. Essentially, the presented algorithm is a generalization of ACL for Lq norms. It performs a sequence of steps, in each step, it picks a node with excess residual and performs "push" operation.
Review for NeurIPS paper: Strongly local p-norm-cut algorithms for semi-supervised learning and local graph clustering
The authors propose a new algorithm for local graph clustering in general Lp norms. The paper introduces new theoretical results and some interesting new tool as the Cheeger inequality specialized to Lp/Lq. The main limitation of the paper is in the additional assumption made in the paper that are not well-motivated.
Machine Learning-Driven Convergence Analysis in Multijurisdictional Compliance Using BERT and K-Means Clustering
Sonani, Raj, Prayas, Lohalekar
Digital data continues to grow, there has been a shift towards using effective regulatory mechanisms to safeguard personal information. The CCPA of California and the General Data Protection Regulation (GDPR) of the European Union are two of the most important privacy laws. The regulation is intended to safeguard consumer privacy, but it varies greatly in scope, definitions, and methods of enforcement. This paper presents a fresh approach to adaptive compliance, using machine learning and emphasizing natural language processing (NLP) as the primary focus of comparison between the GDPR and CCPA. Using NLP, this study compares various regulations to identify areas where they overlap or diverge. This includes the "right to be forgotten" provision in the GDPR and the "opt-out of sale" provision under CCPA. International companies can learn valuable lessons from this report, as it outlines strategies for better enforcement of laws across different nations. Additionally, the paper discusses the challenges of utilizing NLP in legal literature and proposes methods to enhance the model-ability of machine learning models for studying regulations. The study's objective is to "bridge the gap between legal knowledge and technical expertise" by developing regulatory compliance strategies that are more efficient in operation and more effective in data protection.
Guaranteed Recovery of Unambiguous Clusters
Mazooji, Kayvon, Shomorony, Ilan
Clustering is often a challenging problem because of the inherent ambiguity in what the "correct" clustering should be. Even when the number of clusters $K$ is known, this ambiguity often still exists, particularly when there is variation in density among different clusters, and clusters have multiple relatively separated regions of high density. In this paper we propose an information-theoretic characterization of when a $K$-clustering is ambiguous, and design an algorithm that recovers the clustering whenever it is unambiguous. This characterization formalizes the situation when two high density regions within a cluster are separable enough that they look more like two distinct clusters than two truly distinct clusters in the clustering. The algorithm first identifies $K$ partial clusters (or "seeds") using a density-based approach, and then adds unclustered points to the initial $K$ partial clusters in a greedy manner to form a complete clustering. We implement and test a version of the algorithm that is modified to effectively handle overlapping clusters, and observe that it requires little parameter selection and displays improved performance on many datasets compared to widely used algorithms for non-convex cluster recovery.
On Learning Representations for Tabular Data Distillation
Kang, Inwon, Ram, Parikshit, Zhou, Yi, Samulowitz, Horst, Seneviratne, Oshani
Dataset distillation generates a small set of information-rich instances from a large dataset, resulting in reduced storage requirements, privacy or copyright risks, and computational costs for downstream modeling, though much of the research has focused on the image data modality. We study tabular data distillation, which brings in novel challenges such as the inherent feature heterogeneity and the common use of non-differentiable learning models (such as decision tree ensembles and nearest-neighbor predictors). To mitigate these challenges, we present $\texttt{TDColER}$, a tabular data distillation framework via column embeddings-based representation learning. To evaluate this framework, we also present a tabular data distillation benchmark, ${{\sf \small TDBench}}$. Based on an elaborate evaluation on ${{\sf \small TDBench}}$, resulting in 226,890 distilled datasets and 548,880 models trained on them, we demonstrate that $\texttt{TDColER}$ is able to boost the distilled data quality of off-the-shelf distillation schemes by 0.5-143% across 7 different tabular learning models.
Consistent spectral clustering in sparse tensor block models
Vรคlimaa, Ian, Leskelรค, Lasse
High-order clustering aims to classify objects in multiway datasets that are prevalent in various fields such as bioinformatics, social network analysis, and recommendation systems. These tasks often involve data that is sparse and high-dimensional, presenting significant statistical and computational challenges. This paper introduces a tensor block model specifically designed for sparse integer-valued data tensors. We propose a simple spectral clustering algorithm augmented with a trimming step to mitigate noise fluctuations, and identify a density threshold that ensures the algorithm's consistency. Our approach models sparsity using a sub-Poisson noise concentration framework, accommodating heavier than sub-Gaussian tails. Remarkably, this natural class of tensor block models is closed under aggregation across arbitrary modes. Consequently, we obtain a comprehensive framework for evaluating the tradeoff between signal loss and noise reduction during data aggregation. The analysis is based on a novel concentration bound for sparse random Gram matrices. The theoretical findings are illustrated through simulation experiments.
GPT-HTree: A Decision Tree Framework Integrating Hierarchical Clustering and Large Language Models for Explainable Classification
Pei, Te, Alican, Fuat, Yin, Aaron Ontoyin, Ihlamur, Yigit
Decision trees are fundamental tools in machine learning (ML), prized for their interpretability and simplicity in classification tasks. By providing clear decision paths, they enable users to understand and trust the reasoning behind predictions. However, their effectiveness diminishes when applied to heterogeneous datasets comprising entities with varying characteristics. Uniform decision paths often fail to account for the nuanced differences among diverse segments, leading to oversimplified or misleading classifications. Unsupervised clustering methods, on the other hand, excel in discovering latent structures within complex datasets. These methods, including hierarchical clustering, k-means, and DBSCAN, are powerful tools for segmenting populations into meaningful clusters without requiring predefined labels. While they are effective for uncovering hidden patterns, their primary drawback is a lack of explainability. Clusters produced by unsupervised methods often lack intuitive descriptions or actionable insights, making it difficult to interpret their relevance or apply them in practical decision-making scenarios.
Learning under Commission and Omission Event Outliers
Zhang, Yuecheng, Fang, Guanhua, Yu, Wen
Event stream is an important data format in real life. The events are usually expected to follow some regular patterns over time. However, the patterns could be contaminated by unexpected absences or occurrences of events. In this paper, we adopt the temporal point process framework for learning event stream and we provide a simple-but-effective method to deal with both commission and omission event outliers. In particular, we introduce a novel weight function to dynamically adjust the importance of each observed event so that the final estimator could offer multiple statistical merits. We compare the proposed method with the vanilla one in the classification problems, where event streams can be clustered into different groups. Both theoretical and numerical results confirm the effectiveness of our new approach. To our knowledge, our method is the first one to provably handle both commission and omission outliers simultaneously.
Reviews: q-means: A quantum algorithm for unsupervised machine learning
Response to rebuttal: I think the author responses were done well. I think they have satisfactorily answered the questions that I had raised. The reason I am torn between a strong accept and an accept is: most of the techniques used in this paper have already appeared before in various quantum algorithms and are well-known in the quantum community. Having said that, I think putting together known techniques in a rigorous fashion and also practically implementing their algorithm on a quantum simulator is interesting, especially to a problem which is practically important. I think the final aspect might be of interest to a classical ML community; to see how quantum can provide polynomial speedups to relevant ML problems using a toolbag of interesting techniques.
Reviews: Learning Representations for Time Series Clustering
The submission proposes a model for time-series clustering. The model is a novel combination of several existing components: a) a deep recurrent auto-encoder using dilated RNNs, b) a spectral relaxation of the K-means objective and c) a self-supervision loss to discriminate time-series corrupted by random shuffling from the original ones. The model is evaluated on a common benchmark for time-series clustering and achieves superior performance to existing methods. Overall I feel positive about the proposed method as the quantitative results look promising and using the spectral relaxation of K-means for deep clustering is novel and original. Nevertheless I do have some concerns about the submission in its current form: 1.)