AITopics

2410.11063

Country:

North America > United States > New York > New York County > New York City (0.14)
North America > United States > California > Alameda County > Berkeley (0.14)
Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.14)
(26 more...)

Genre: Research Report (0.64)

Technology:

Information Technology > Data Science (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Optimization (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.68)

MacLellan, Christopher J., Harpstead, Erik, Aleven, Vincent, Koedinger, Kenneth R.

TRESTLE: A Model of Concept Formation in Structured Domains

arXiv.org Artificial IntelligenceOct-14-2024

The literature on concept formation has demonstrated that humans are capable of learning concepts incrementally, with a variety of attribute types, and in both supervised and unsupervised settings. Many models of concept formation focus on a subset of these characteristics, but none account for all of them. In this paper, we present TRESTLE, an incremental account of probabilistic concept formation in structured domains that unifies prior concept learning models. TRESTLE works by creating a hierarchical categorization tree that can be used to predict missing attribute values and cluster sets of examples into conceptually meaningful groups. It updates its knowledge by partially matching novel structures and sorting them into its categorization tree. Finally, the system supports mixed-data representations, including nominal, numeric, relational, and component attributes. We evaluate TRESTLE's performance on a supervised learning task and an unsupervised clustering task. For both tasks, we compare it to a nonincremental model and to human participants. We find that this new categorization model is competitive with the nonincremental approach and more closely approximates human behavior on both tasks. These results serve as an initial demonstration of TRESTLE's capabilities and show that, by taking key characteristics of human learning into account, it can better model behavior than approaches that ignore them.

artificial intelligence, inductive learning, machine learning, (19 more...)

2410.10588

Country:

North America > United States > Pennsylvania > Allegheny County > Pittsburgh (0.14)
Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.14)
North America > United States > Massachusetts > Middlesex County > Cambridge (0.04)
(7 more...)

Genre: Research Report > New Finding (0.68)

Industry:

Education (1.00)
Government > Regional Government > North America Government > United States Government (0.67)
Leisure & Entertainment > Games > Computer Games (0.46)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Inductive Learning (0.88)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.66)

Nagayi, Mayimunah, Nyirenda, Clement

Enhancing Affinity Propagation for Improved Public Sentiment Insights

arXiv.org Artificial IntelligenceOct-12-2024

With the large amount of data generated every day, public sentiment is a key factor for various fields, including marketing, politics, and social research. Understanding the public sentiment about different topics can provide valuable insights. However, most traditional approaches for sentiment analysis often depend on supervised learning, which requires a significant amount of labeled data. This makes it both expensive and time-consuming to implement. This project introduces an approach using unsupervised learning techniques, particularly Affinity Propagation (AP) clustering, to analyze sentiment. AP clustering groups text data based on natural patterns, without needing predefined cluster numbers. The paper compares AP with K-means clustering, using TF-IDF Vectorization for text representation and Principal Component Analysis (PCA) for dimensionality reduction. To enhance performance, AP is combined with Agglomerative Hierarchical Clustering. This hybrid method refines clusters further, capturing both global and local sentiment structures more effectively. The effectiveness of these methods is evaluated using the Silhouette Score, Calinski-Harabasz Score, and Davies-Bouldin Index. Results show that AP with Agglomerative Hierarchical Clustering significantly outperforms K-means. This research contributes to Natural Language Processing (NLP) by proposing a scalable and efficient unsupervised learning framework for sentiment analysis, highlighting the significant societal impact of advanced AI techniques in analyzing public sentiment without the need for extensive labeled data.

artificial intelligence, machine learning, natural language, (16 more...)

2410.12862

Country:

North America > United States > Massachusetts > Suffolk County > Boston (0.04)
North America > United States > California > Los Angeles County > Los Angeles (0.04)
Asia > Middle East > Iraq > Kurdistan Region > Duhok Governorate > Zakho (0.04)
(2 more...)

Genre: Research Report > New Finding (0.88)

Industry: Information Technology > Services (0.46)

Technology:

Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (1.00)

Leiber, Collin, Strauß, Niklas, Schubert, Matthias, Seidl, Thomas

Dying Clusters Is All You Need -- Deep Clustering With an Unknown Number of Clusters

arXiv.org Artificial IntelligenceOct-12-2024

Finding meaningful groups, i.e., clusters, in high-dimensional data such as images or texts without labeled data at hand is an important challenge in data mining. In recent years, deep clustering methods have achieved remarkable results in these tasks. However, most of these methods require the user to specify the number of clusters in advance. This is a major limitation since the number of clusters is typically unknown if labeled data is unavailable. Thus, an area of research has emerged that addresses this problem. Most of these approaches estimate the number of clusters separated from the clustering process. This results in a strong dependency of the clustering result on the quality of the initial embedding. Other approaches are tailored to specific clustering processes, making them hard to adapt to other scenarios. In this paper, we propose UNSEEN, a general framework that, starting from a given upper bound, is able to estimate the number of clusters. To the best of our knowledge, it is the first method that can be easily combined with various deep clustering algorithms. We demonstrate the applicability of our approach by combining UNSEEN with the popular deep clustering algorithms DCN, DEC, and DKM and verify its effectiveness through an extensive experimental evaluation on several image and tabular datasets. Moreover, we perform numerous ablations to analyze our approach and show the importance of its components. The code is available at: https://github.com/collinleiber/UNSEEN

artificial intelligence, machine learning, unseen, (17 more...)

2410.09491

Country:

Europe > Germany > Bavaria > Upper Bavaria > Munich (0.04)
Asia > China > Beijing > Beijing (0.04)
Oceania > Australia > Victoria > Melbourne (0.04)
(19 more...)

Genre: Research Report > New Finding (0.46)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (1.00)

Nkurunziza, Jean Pacifique, Nahayo, Fulgence, Gillis, Nicolas

Orthogonal Nonnegative Matrix Factorization with the Kullback-Leibler divergence

arXiv.org Machine LearningOct-12-2024

Orthogonal nonnegative matrix factorization (ONMF) has become a standard approach for clustering. As far as we know, most works on ONMF rely on the Frobenius norm to assess the quality of the approximation. This paper presents a new model and algorithm for ONMF that minimizes the Kullback-Leibler (KL) divergence. As opposed to the Frobenius norm which assumes Gaussian noise, the KL divergence is the maximum likelihood estimator for Poisson-distributed data, which can model better sparse vectors of word counts in document data sets and photo counting processes in imaging. We develop an algorithm based on alternating optimization, KL-ONMF, and show that it performs favorably with the Frobenius-norm based ONMF for document classification and hyperspectral image unmixing.

divergence, kl-onmf, matrix factorization, (13 more...)

arXiv.org Machine Learning

2410.07786

Country:

Europe > Belgium (0.04)
Africa > Burundi > Bujumbura Mairie > Bujumbura (0.04)

Genre: Research Report (0.40)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.46)

Neural Information Processing SystemsOct-11-2024, 15:22:27 GMT

Recovering Unbalanced Communities in the Stochastic Block Model with Application to Clustering with a Faulty Oracle

The stochastic block model (SBM) is a fundamental model for studying graph clustering or community detection in networks. It has received great attention in the last decade and the balanced case, i.e., assuming all clusters have large size, has been well studied. However, our understanding of SBM with unbalanced communities (arguably, more relevant in practice) is still limited. In this paper, we provide a simple SVD-based algorithm for recovering the communities in the SBM with communities of varying sizes.We improve upon a result of Ailon, Chen and Xu [ICML 2013; JMLR 2015] by removing the assumption that there is a large interval such that the sizes of clusters do not fall in, and also remove the dependency of the size of the recoverable clusters on the number of underlying clusters. We further complement our theoretical improvements with experimental comparisons.Under the planted clique conjecture, the size of the clusters that can be recovered by our algorithm is nearly optimal (up to poly-logarithmic factors) when the probability parameters are constant. As a byproduct, we obtain an efficient clustering algorithm with sublinear query complexity in a faulty oracle model, which is capable of detecting all clusters larger than \tilde{\Omega}({\sqrt{n}}), even in the presence of \Omega(n) small clusters in the graph.

algorithm, stochastic block model, unbalanced community, (8 more...)

Technology:

Information Technology > Data Science > Data Mining (0.61)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.61)

Neural Information Processing SystemsOct-11-2024, 13:34:02 GMT

No Representation Rules Them All in Category Discovery

In this paper we tackle the problem of Generalized Category Discovery (GCD). Specifically, given a dataset with labelled and unlabelled images, the task is to cluster all images in the unlabelled subset, whether or not they belong to the labelled categories. Our first contribution is to recognise that most existing GCD benchmarks only contain labels for a single clustering of the data, making it difficult to ascertain whether models are leveraging the available labels to solve the GCD task, or simply solving an unsupervised clustering problem. As such, we present a synthetic dataset, named'Clevr-4', for category discovery. Clevr-4 contains four equally valid partitions of the data, i.e based on object'shape', 'texture' or'color' or'count'.

category discovery, clevr-4, dataset, (2 more...)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.64)

Neural Information Processing SystemsOct-11-2024, 12:19:34 GMT

k -Means Clustering with Distance-Based Privacy

In this paper, we initiate the study of Euclidean clustering with Distance-based privacy. Distance-based privacy is motivated by the fact that it is often only needed to protect the privacy of exact, rather than approximate, locations. We provide constant-approximate algorithms for k -means and k -median clustering, with additive error depending only on the attacker's precision bound \rho, rather than the radius \Lambda of the space. In addition, we empirically demonstrate that our algorithm performs significantly better than previous differentially private clustering algorithms, as well as naive distance-based private clustering baselines.

algorithm, clustering, distance-based privacy

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.78)

Neural Information Processing SystemsOct-11-2024, 05:24:55 GMT

Improved Guarantees for k-means++ and k-means++ Parallel

In this paper, we study k-means and k-means, the two most popular algorithms for the classic k-means clustering problem. We provide novel analyses and show improved approximation and bi-criteria approximation guarantees for k-means and k-means . Our results give a better theoretical justification for why these algorithms perform extremely well in practice.

improved guarantee

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.80)

Neural Information Processing SystemsOct-11-2024, 02:11:18 GMT

From Trees to Continuous Embeddings and Back: Hyperbolic Hierarchical Clustering

Similarity-based Hierarchical Clustering (HC) is a classical unsupervised machine learning algorithm that has traditionally been solved with heuristic algorithms like Average-Linkage. Recently, Dasgupta reframed HC as a discrete optimization problem by introducing a global cost function measuring the quality of a given tree. In this work, we provide the first continuous relaxation of Dasgupta's discrete optimization problem with provable quality guarantees. The key idea of our method, HypHC, is showing a direct correspondence from discrete trees to continuous representations (via the hyperbolic embeddings of their leaf nodes) and back (via a decoding algorithm that maps leaf embeddings to a dendrogram), allowing us to search the space of discrete binary trees with continuous optimization. Building on analogies between trees and hyperbolic space, we derive a continuous analogue for the notion of lowest common ancestor, which leads to a continuous relaxation of Dasgupta's discrete objective.

continuous embedding and back, hierarchical clustering, hyperbolic hierarchical clustering, (6 more...)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Optimization (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.64)