AITopics

2309.01301

Country:

Europe > Russia (0.14)
Asia > Russia (0.14)
Europe > Italy (0.04)
(25 more...)

Genre:

Research Report (1.00)
Workflow (0.68)

Industry:

Leisure & Entertainment > Sports (1.00)
Health & Medicine > Pharmaceuticals & Biotechnology (1.00)
Government > Regional Government > North America Government > United States Government (0.67)
(2 more...)

Technology:

Information Technology > Data Science > Data Mining (1.00)
Information Technology > Communications > Social Media (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.86)

arXiv.org Machine LearningSep-30-2023

Normalised clustering accuracy: An asymmetric external cluster validity measure

Gagolewski, Marek

There is no, nor will there ever be, single best clustering algorithm, but we would still like to be able to distinguish between methods which work well on certain task types and those that systematically underperform. Clustering algorithms are traditionally evaluated using either internal or external validity measures. Internal measures quantify different aspects of the obtained partitions, e.g., the average degree of cluster compactness or point separability. Yet, their validity is questionable, because the clusterings they promote can sometimes be meaningless. External measures, on the other hand, compare the algorithms' outputs to the reference, ground truth groupings that are provided by experts. In this paper, we argue that the commonly-used classical partition similarity scores, such as the normalised mutual information, Fowlkes-Mallows, or adjusted Rand index, miss some desirable properties, e.g., they do not identify worst-case scenarios correctly or are not easily interpretable. This makes comparing clustering algorithms across many benchmark datasets difficult. To remedy these issues, we propose and analyse a new measure: a version of the optimal set-matching accuracy, which is normalised, monotonic, scale invariant, and corrected for the imbalancedness of cluster sizes (but neither symmetric nor adjusted for chance).

artificial intelligence, data mining, machine learning, (20 more...)

2209.02935

Country:

Europe > Poland > Masovia Province > Warsaw (0.04)
Oceania > Australia (0.04)
North America > United States > California > Orange County > Irvine (0.04)
Europe > Germany > Baden-Württemberg > Karlsruhe Region > Karlsruhe (0.04)

Genre: Research Report (0.50)

Technology:

Information Technology > Data Science > Data Mining (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (1.00)

Andriantsiory, Dina Faneva, Coti, Camille, Geloun, Joseph Ben, Lebbah, Mustapha

Parallel Computation of Multi-Slice Clustering of Third-Order Tensors

arXiv.org Artificial IntelligenceSep-29-2023

Machine Learning approaches like clustering methods deal with massive datasets that present an increasing challenge. We devise parallel algorithms to compute the Multi-Slice Clustering (MSC) for 3rd-order tensors. The MSC method is based on spectral analysis of the tensor slices and works independently on each tensor mode. Such features fit well in the parallel paradigm via a distributed memory system. We show that our parallel scheme outperforms sequential computing and allows for the scalability of the MSC method.

algorithm, computation, dataset, (14 more...)

2309.17383

Country:

North America > United States > Pennsylvania > Philadelphia County > Philadelphia (0.04)
North America > United States > New York (0.04)
North America > Canada > Quebec > Montreal (0.04)
(3 more...)

Genre: Research Report (0.41)

Industry: Health & Medicine (0.94)

Technology:

Information Technology > Architecture > Distributed Systems (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.89)

Venkataramanan, Aishwarya, Benbihi, Assia, Laviale, Martin, Pradalier, Cedric

Gaussian Latent Representations for Uncertainty Estimation using Mahalanobis Distance in Deep Classifiers

arXiv.org Artificial IntelligenceSep-29-2023

Recent works show that the data distribution in a network's latent space is useful for estimating classification uncertainty and detecting Out-of-distribution (OOD) samples. To obtain a well-regularized latent space that is conducive for uncertainty estimation, existing methods bring in significant changes to model architectures and training procedures. In this paper, we present a lightweight, fast, and high-performance regularization method for Mahalanobis distance-based uncertainty prediction, and that requires minimal changes to the network's architecture. To derive Gaussian latent representation favourable for Mahalanobis Distance calculation, we introduce a self-supervised representation learning method that separates in-class representations into multiple Gaussians. Classes with non-Gaussian representations are automatically identified and dynamically clustered into multiple new classes that are approximately Gaussian. Evaluation on standard OOD benchmarks shows that our method achieves state-of-the-art results on OOD detection with minimal inference time, and is very competitive on predictive probability calibration. Finally, we show the applicability of our method to a real-life computer vision use case on microorganism classification.

detection, ood detection, representation, (16 more...)

2305.13849

Country:

Europe > France (0.04)
Oceania > New Zealand > North Island > Auckland Region > Auckland (0.04)
Europe > Czechia > Prague (0.04)

Genre: Research Report (0.64)

Industry: Information Technology (0.46)

Technology:

Information Technology > Data Science > Data Mining (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (1.00)
(3 more...)

arXiv.org Machine LearningSep-29-2023

Statistically Optimal K-means Clustering via Nonnegative Low-rank Semidefinite Programming

Zhuang, Yubo, Chen, Xiaohui, Yang, Yun, Zhang, Richard Y.

$K$-means clustering is a widely used machine learning method for identifying patterns in large datasets. Semidefinite programming (SDP) relaxations have recently been proposed for solving the $K$-means optimization problem that enjoy strong statistical optimality guarantees, but the prohibitive cost of implementing an SDP solver renders these guarantees inaccessible to practical datasets. By contrast, nonnegative matrix factorization (NMF) is a simple clustering algorithm that is widely used by machine learning practitioners, but without a solid statistical underpinning nor rigorous guarantees. In this paper, we describe an NMF-like algorithm that works by solving a nonnegative low-rank restriction of the SDP relaxed $K$-means formulation using a nonconvex Burer--Monteiro factorization approach. The resulting algorithm is just as simple and scalable as state-of-the-art NMF algorithms, while also enjoying the same strong statistical optimality guarantees as the SDP. In our experiments, we observe that our algorithm achieves substantially smaller mis-clustering errors compared to the existing state-of-the-art.

algorithm, artificial intelligence, machine learning, (17 more...)

2305.18436

Country:

North America > United States > New York > New York County > New York City (0.04)
North America > United States > Illinois (0.04)
North America > United States > California > San Diego County > San Diego (0.04)
(2 more...)

Genre: Research Report > New Finding (0.66)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (1.00)

arXiv.org Artificial IntelligenceSep-28-2023

Attribute Graph Clustering via Learnable Augmentation

Yang, Xihong, Liu, Yue, Liang, Ke, Zhou, Sihang, Liu, Xinwang, Zhu, En

Contrastive deep graph clustering (CDGC) utilizes contrastive learning to group nodes into different clusters. Better augmentation techniques benefit the quality of the contrastive samples, thus being one of key factors to improve performance. However, the augmentation samples in existing methods are always predefined by human experiences, and agnostic from the downstream task clustering, thus leading to high human resource costs and poor performance. To this end, we propose an Attribute Graph Clustering method via Learnable Augmentation (\textbf{AGCLA}), which introduces learnable augmentors for high-quality and suitable augmented samples for CDGC. Specifically, we design two learnable augmentors for attribute and structure information, respectively. Besides, two refinement matrices, including the high-confidence pseudo-label matrix and the cross-view sample similarity matrix, are generated to improve the reliability of the learned affinity matrix. During the training procedure, we notice that there exist differences between the optimization goals for training learnable augmentors and contrastive learning networks. In other words, we should both guarantee the consistency of the embeddings as well as the diversity of the augmented samples. Thus, an adversarial learning mechanism is designed in our method. Moreover, a two-stage training strategy is leveraged for the high-confidence refinement matrices. Extensive experimental results demonstrate the effectiveness of AGCLA on six benchmark datasets.

augmentation, graph, matrix, (15 more...)

2212.03559

Genre: Research Report > New Finding (0.48)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (1.00)

Beretta, Lorenzo, Cohen-Addad, Vincent, Lattanzi, Silvio, Parotsidis, Nikos

Multi-Swap $k$-Means++

arXiv.org Artificial IntelligenceSep-28-2023

The $k$-means++ algorithm of Arthur and Vassilvitskii (SODA 2007) is often the practitioners' choice algorithm for optimizing the popular $k$-means clustering objective and is known to give an $O(\log k)$-approximation in expectation. To obtain higher quality solutions, Lattanzi and Sohler (ICML 2019) proposed augmenting $k$-means++ with $O(k \log \log k)$ local search steps obtained through the $k$-means++ sampling distribution to yield a $c$-approximation to the $k$-means clustering problem, where $c$ is a large absolute constant. Here we generalize and extend their local search algorithm by considering larger and more sophisticated local search neighborhoods hence allowing to swap multiple centers at the same time. Our algorithm achieves a $9 + \varepsilon$ approximation ratio, which is the best possible for local search. Importantly we show that our approach yields substantial practical improvements, we show significant quality improvements over the approach of Lattanzi and Sohler (ICML 2019) on several datasets.

algorithm, iteration, swap, (15 more...)

2309.16384

Country:

Asia > Afghanistan > Parwan Province > Charikar (0.04)
North America > United States > Maryland > Baltimore (0.04)
North America > United States > Arizona > Maricopa County > Phoenix (0.04)
(8 more...)

Genre: Research Report (1.00)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Search (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.48)

Ismail-Fawaz, Ali, Fawaz, Hassan Ismail, Petitjean, François, Devanne, Maxime, Weber, Jonathan, Berretti, Stefano, Webb, Geoffrey I., Forestier, Germain

ShapeDBA: Generating Effective Time Series Prototypes using ShapeDTW Barycenter Averaging

arXiv.org Artificial IntelligenceSep-28-2023

Time series data can be found in almost every domain, ranging from the medical field to manufacturing and wireless communication. Generating realistic and useful exemplars and prototypes is a fundamental data analysis task. In this paper, we investigate a novel approach to generating realistic and useful exemplars and prototypes for time series data. Our approach uses a new form of time series average, the ShapeDTW Barycentric Average. We therefore turn our attention to accurately generating time series prototypes with a novel approach. The existing time series prototyping approaches rely on the Dynamic Time Warping (DTW) similarity measure such as DTW Barycentering Average (DBA) and SoftDBA. These last approaches suffer from a common problem of generating out-of-distribution artifacts in their prototypes. This is mostly caused by the DTW variant used and its incapability of detecting neighborhood similarities, instead it detects absolute similarities. Our proposed method, ShapeDBA, uses the ShapeDTW variant of DTW, that overcomes this issue. We chose time series clustering, a popular form of time series analysis to evaluate the outcome of ShapeDBA compared to the other prototyping approaches. Coupled with the k-means clustering algorithm, and evaluated on a total of 123 datasets from the UCR archive, our proposed averaging approach is able to achieve new state-of-the-art results in terms of Adjusted Rand Index.

algorithm, shapedba, time sery, (13 more...)

2309.16353

Country:

Europe > Italy > Tuscany > Florence (0.04)
Europe > France > Grand Est > Bas-Rhin > Strasbourg (0.04)

Genre:

Research Report > Promising Solution (0.54)
Research Report > Experimental Study (0.49)

Industry: Health & Medicine (0.48)

Technology:

Information Technology > Data Science > Data Mining (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (1.00)

Melikechi, Omar, Dunson, David B.

Ellipsoid fitting with the Cayley transform

arXiv.org Machine LearningSep-27-2023

We introduce Cayley transform ellipsoid fitting (CTEF), an algorithm that uses the Cayley transform to fit ellipsoids to noisy data in any dimension. Unlike many ellipsoid fitting methods, CTEF is ellipsoid specific, meaning it always returns elliptic solutions, and can fit arbitrary ellipsoids. It also significantly outperforms other fitting methods when data are not uniformly distributed over the surface of an ellipsoid. Inspired by growing calls for interpretable and reproducible methods in machine learning, we apply CTEF to dimension reduction, data visualization, and clustering in the context of cell cycle and circadian rhythm data and several classical toy examples. Since CTEF captures global curvature, it extracts nonlinear features in data that other machine learning methods fail to identify. For example, on the clustering examples CTEF outperforms 10 popular algorithms.

artificial intelligence, ellipsoid, machine learning, (17 more...)

2304.1063

Country:

North America > United States > North Carolina > Durham County > Durham (0.04)
Asia > Middle East > Israel (0.04)
North America > United States > Indiana (0.04)
Europe > Italy > Tuscany > Florence (0.04)

Genre: Research Report > New Finding (0.93)

Industry: Health & Medicine > Therapeutic Area (0.46)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.47)

Lai, Songning, Li, Jiakang, Lu, Yonggang

A Comprehensive Review of Community Detection in Graphs

arXiv.org Artificial IntelligenceSep-26-2023

The study of complex networks has significantly advanced our understanding of community structures which serves as a crucial feature of real-world graphs. Detecting communities in graphs is a challenging problem with applications in sociology, biology, and computer science. Despite the efforts of an interdisciplinary community of scientists, a satisfactory solution to this problem has not yet been achieved. This review article delves into the topic of community detection in graphs, which serves as a crucial role in understanding the organization and functioning of complex systems. We begin by introducing the concept of community structure, which refers to the arrangement of vertices into clusters, with strong internal connections and weaker connections between clusters. Then, we provide a thorough exposition of various community detection methods, including a new method designed by us. Additionally, we explore real-world applications of community detection in diverse networks. In conclusion, this comprehensive review provides a deep understanding of community detection in graphs. It serves as a valuable resource for researchers and practitioners in multiple disciplines, offering insights into the challenges, methodologies, and applications of community detection in complex networks.

algorithm, community detection, graph, (12 more...)

2309.11798

Country:

North America > United States (1.00)
Asia > Singapore (0.04)
Asia > China > Gansu Province > Lanzhou (0.04)
(9 more...)

Genre:

Research Report (1.00)
Overview (1.00)

Industry:

Information Technology (0.95)
Health & Medicine > Therapeutic Area (0.93)
Government > Regional Government > North America Government > United States Government (0.68)

Technology:

Information Technology > Data Science > Data Mining (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.94)
Information Technology > Artificial Intelligence > Representation & Reasoning > Uncertainty > Bayesian Inference (0.93)
(3 more...)