AITopics

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.63)
Information Technology > Artificial Intelligence > Machine Learning > Unsupervised or Indirectly Supervised Learning (0.40)
Information Technology > Artificial Intelligence > Machine Learning > Inductive Learning (0.40)

Neural Information Processing SystemsJan-23-2025, 09:09:30 GMT

Review for NeurIPS paper: Strongly local p-norm-cut algorithms for semi-supervised learning and local graph clustering

The authors propose a new algorithm for local graph clustering in general Lp norms. The paper introduces new theoretical results and some interesting new tool as the Cheeger inequality specialized to Lp/Lq. The main limitation of the paper is in the additional assumption made in the paper that are not well-motivated.

local p-norm-cut algorithm, neurips paper, semi-supervised learning and local graph

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.79)

Sonani, Raj, Prayas, Lohalekar

Machine Learning-Driven Convergence Analysis in Multijurisdictional Compliance Using BERT and K-Means Clustering

Digital data continues to grow, there has been a shift towards using effective regulatory mechanisms to safeguard personal information. The CCPA of California and the General Data Protection Regulation (GDPR) of the European Union are two of the most important privacy laws. The regulation is intended to safeguard consumer privacy, but it varies greatly in scope, definitions, and methods of enforcement. This paper presents a fresh approach to adaptive compliance, using machine learning and emphasizing natural language processing (NLP) as the primary focus of comparison between the GDPR and CCPA. Using NLP, this study compares various regulations to identify areas where they overlap or diverge. This includes the "right to be forgotten" provision in the GDPR and the "opt-out of sale" provision under CCPA. International companies can learn valuable lessons from this report, as it outlines strategies for better enforcement of laws across different nations. Additionally, the paper discusses the challenges of utilizing NLP in legal literature and proposes methods to enhance the model-ability of machine learning models for studying regulations. The study's objective is to "bridge the gap between legal knowledge and technical expertise" by developing regulatory compliance strategies that are more efficient in operation and more effective in data protection.

artificial intelligence, machine learning, natural language, (17 more...)

doi: 10.6084/m9.figshare.28259810

2502.10413

Country:

North America > United States > California (0.26)
North America > United States > New York (0.04)
Europe > Portugal (0.04)

Genre: Research Report (1.00)

Industry:

Law (1.00)
Information Technology > Security & Privacy (1.00)

Technology:

Information Technology > Security & Privacy (1.00)
Information Technology > Artificial Intelligence > Natural Language > Text Processing (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.82)
Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (0.69)

Mazooji, Kayvon, Shomorony, Ilan

Guaranteed Recovery of Unambiguous Clusters

Clustering is often a challenging problem because of the inherent ambiguity in what the "correct" clustering should be. Even when the number of clusters $K$ is known, this ambiguity often still exists, particularly when there is variation in density among different clusters, and clusters have multiple relatively separated regions of high density. In this paper we propose an information-theoretic characterization of when a $K$-clustering is ambiguous, and design an algorithm that recovers the clustering whenever it is unambiguous. This characterization formalizes the situation when two high density regions within a cluster are separable enough that they look more like two distinct clusters than two truly distinct clusters in the clustering. The algorithm first identifies $K$ partial clusters (or "seeds") using a density-based approach, and then adds unclustered points to the initial $K$ partial clusters in a greedy manner to form a complete clustering. We implement and test a version of the algorithm that is modified to effectively handle overlapping clusters, and observe that it requires little parameter selection and displays improved performance on many datasets compared to widely used algorithms for non-convex cluster recovery.

algorithm, artificial intelligence, machine learning, (18 more...)

2501.13093

Country:

North America > United States > Illinois > Champaign County > Urbana (0.14)
North America > United States > Wisconsin (0.04)
North America > United States > Pennsylvania > Philadelphia County > Philadelphia (0.04)

Genre: Research Report (0.50)

Technology:

Information Technology > Data Science (0.94)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.91)

On Learning Representations for Tabular Data Distillation

Kang, Inwon, Ram, Parikshit, Zhou, Yi, Samulowitz, Horst, Seneviratne, Oshani

Dataset distillation generates a small set of information-rich instances from a large dataset, resulting in reduced storage requirements, privacy or copyright risks, and computational costs for downstream modeling, though much of the research has focused on the image data modality. We study tabular data distillation, which brings in novel challenges such as the inherent feature heterogeneity and the common use of non-differentiable learning models (such as decision tree ensembles and nearest-neighbor predictors). To mitigate these challenges, we present $\texttt{TDColER}$, a tabular data distillation framework via column embeddings-based representation learning. To evaluate this framework, we also present a tabular data distillation benchmark, ${{\sf \small TDBench}}$. Based on an elaborate evaluation on ${{\sf \small TDBench}}$, resulting in 226,890 distilled datasets and 548,880 models trained on them, we demonstrate that $\texttt{TDColER}$ is able to boost the distilled data quality of off-the-shelf distillation schemes by 0.5-143% across 7 different tabular learning models.

data mining, distillation method, machine learning, (20 more...)

2501.13905

Country:

Europe > Austria > Vienna (0.14)
Oceania > Australia > Victoria > Melbourne (0.04)
North America > United States > New York > New York County > New York City (0.04)
(4 more...)

Genre: Research Report > New Finding (0.46)

Industry:

Information Technology (0.46)
Education (0.46)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Data Science > Data Mining (0.93)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.68)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.46)

Välimaa, Ian, Leskelä, Lasse

Consistent spectral clustering in sparse tensor block models

High-order clustering aims to classify objects in multiway datasets that are prevalent in various fields such as bioinformatics, social network analysis, and recommendation systems. These tasks often involve data that is sparse and high-dimensional, presenting significant statistical and computational challenges. This paper introduces a tensor block model specifically designed for sparse integer-valued data tensors. We propose a simple spectral clustering algorithm augmented with a trimming step to mitigate noise fluctuations, and identify a density threshold that ensures the algorithm's consistency. Our approach models sparsity using a sub-Poisson noise concentration framework, accommodating heavier than sub-Gaussian tails. Remarkably, this natural class of tensor block models is closed under aggregation across arbitrary modes. Consequently, we obtain a comprehensive framework for evaluating the tradeoff between signal loss and noise reduction during data aggregation. The analysis is based on a novel concentration bound for sparse random Gram matrices. The theoretical findings are illustrated through simulation experiments.

artificial intelligence, machine learning, matrix, (19 more...)

2501.1382

Country:

North America > United States > Washington > King County > Seattle (0.04)
North America > United States > Louisiana > Orleans Parish > New Orleans (0.04)
Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)

Genre: Research Report (0.64)

Industry: Information Technology (0.34)

Technology:

Information Technology > Data Science (1.00)
Information Technology > Communications (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.66)

GPT-HTree: A Decision Tree Framework Integrating Hierarchical Clustering and Large Language Models for Explainable Classification

Pei, Te, Alican, Fuat, Yin, Aaron Ontoyin, Ihlamur, Yigit

Decision trees are fundamental tools in machine learning (ML), prized for their interpretability and simplicity in classification tasks. By providing clear decision paths, they enable users to understand and trust the reasoning behind predictions. However, their effectiveness diminishes when applied to heterogeneous datasets comprising entities with varying characteristics. Uniform decision paths often fail to account for the nuanced differences among diverse segments, leading to oversimplified or misleading classifications. Unsupervised clustering methods, on the other hand, excel in discovering latent structures within complex datasets. These methods, including hierarchical clustering, k-means, and DBSCAN, are powerful tools for segmenting populations into meaningful clusters without requiring predefined labels. While they are effective for uncovering hidden patterns, their primary drawback is a lack of explainability. Clusters produced by unsupervised methods often lack intuitive descriptions or actionable insights, making it difficult to interpret their relevance or apply them in practical decision-making scenarios.

arxiv preprint arxiv, decision tree, language model, (13 more...)

2501.13743

Country:

North America > United States > New York (0.04)
North America > United States > California > San Francisco County > San Francisco (0.04)
Europe > United Kingdom > England > Oxfordshire > Oxford (0.04)

Genre: Research Report (0.50)

Industry:

Information Technology (0.68)
Banking & Finance (0.48)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Decision Tree Learning (1.00)

arXiv.org Machine LearningJan-23-2025

Learning under Commission and Omission Event Outliers

Zhang, Yuecheng, Fang, Guanhua, Yu, Wen

Event stream is an important data format in real life. The events are usually expected to follow some regular patterns over time. However, the patterns could be contaminated by unexpected absences or occurrences of events. In this paper, we adopt the temporal point process framework for learning event stream and we provide a simple-but-effective method to deal with both commission and omission event outliers. In particular, we introduce a novel weight function to dynamically adjust the importance of each observed event so that the final estimator could offer multiple statistical merits. We compare the proposed method with the vanilla one in the classification problems, where event streams can be clustered into different groups. Both theoretical and numerical results confirm the effectiveness of our new approach. To our knowledge, our method is the first one to provably handle both commission and omission outliers simultaneously.

artificial intelligence, data mining, machine learning, (18 more...)

arXiv.org Machine Learning

2501.13599

Country: Europe > Austria (0.27)

Genre: Research Report > New Finding (0.48)

Industry:

Health & Medicine (0.92)
Information Technology > Services (0.45)
Information Technology > Security & Privacy (0.45)

Technology:

Information Technology > Data Science > Data Mining (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (0.68)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.46)

Neural Information Processing SystemsJan-21-2025, 22:33:33 GMT

Reviews: q-means: A quantum algorithm for unsupervised machine learning

Response to rebuttal: I think the author responses were done well. I think they have satisfactorily answered the questions that I had raised. The reason I am torn between a strong accept and an accept is: most of the techniques used in this paper have already appeared before in various quantum algorithms and are well-known in the quantum community. Having said that, I think putting together known techniques in a rigorous fashion and also practically implementing their algorithm on a quantum simulator is interesting, especially to a problem which is practically important. I think the final aspect might be of interest to a classical ML community; to see how quantum can provide polynomial speedups to relevant ML problems using a toolbag of interesting techniques.

algorithm, quantum algorithm, unsupervised machine, (10 more...)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.36)

Neural Information Processing SystemsJan-21-2025, 19:55:41 GMT

Reviews: Learning Representations for Time Series Clustering

The submission proposes a model for time-series clustering. The model is a novel combination of several existing components: a) a deep recurrent auto-encoder using dilated RNNs, b) a spectral relaxation of the K-means objective and c) a self-supervision loss to discriminate time-series corrupted by random shuffling from the original ones. The model is evaluated on a common benchmark for time-series clustering and achieves superior performance to existing methods. Overall I feel positive about the proposed method as the quantitative results look promising and using the spectral relaxation of K-means for deep clustering is novel and original. Nevertheless I do have some concerns about the submission in its current form: 1.)

deep learning, learning representation, time sery clustering, (3 more...)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.61)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.35)