AITopics | Clustering

Collaborating Authors

Clustering

Cluster analysis or clustering is the task of grouping a set of objects in such a way that objects in the same group (called a cluster) are more similar (in some sense) to each other than to those in other groups (clusters). It is a main task of exploratory data mining, and a common technique for statistical data analysis, used in many fields, including machine learning, pattern recognition, image analysis, information retrieval, bioinformatics, data compression, and computer graphics. (Wikipedia)

News Overviews Instructional Materials AI-Alerts Classics

Sharp-SSL: Selective high-dimensional axis-aligned random projections for semi-supervised learning

Wang, Tengyao, Dobriban, Edgar, Gataric, Milana, Samworth, Richard J.

arXiv.org Machine LearningApr-18-2023

We propose a new method for high-dimensional semi-supervised learning problems based on the careful aggregation of the results of a low-dimensional procedure applied to many axis-aligned random projections of the data. Our primary goal is to identify important variables for distinguishing between the classes; existing low-dimensional methods can then be applied for final class assignment. Motivated by a generalized Rayleigh quotient, we score projections according to the traces of the estimated whitened between-class covariance matrices on the projected data. This enables us to assign an importance weight to each variable for a given projection, and to select our signal variables by aggregating these weights over high-scoring projections. Our theory shows that the resulting Sharp-SSL algorithm is able to recover the signal coordinates with high probability when we aggregate over sufficiently many random projections and when the base procedure estimates the whitened between-class covariance matrix sufficiently well. The Gaussian EM algorithm is a natural choice as a base procedure, and we provide a new analysis of its performance in semi-supervised settings that controls the parameter estimation error in terms of the proportion of labeled data in the sample. Numerical results on both simulated data and a real colon tumor dataset support the excellent empirical performance of the method.

algorithm, artificial intelligence, machine learning, (18 more...)

arXiv.org Machine Learning

2304.09154

Country:

Asia > Middle East > Jordan (0.04)
North America > United States > Wisconsin > Dane County > Madison (0.04)
North America > United States > Pennsylvania (0.04)
(4 more...)

Genre: Research Report > New Finding (0.92)

Industry:

Health & Medicine (0.67)
Education (0.48)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Unsupervised or Indirectly Supervised Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Directed Networks > Bayesian Learning (0.46)

Add feedback

K-means Clustering Based Feature Consistency Alignment for Label-free Model Evaluation

Miao, Shuyu, Zheng, Lin, Liu, Jingjing, Jin, and Hong

arXiv.org Artificial IntelligenceApr-17-2023

The label-free model evaluation aims to predict the model performance on various test sets without relying on ground truths. The main challenge of this task is the absence of labels in the test data, unlike in classical supervised model evaluation. This paper presents our solutions for the 1st DataCV Challenge of the Visual Dataset Understanding workshop at CVPR 2023. Firstly, we propose a novel method called K-means Clustering Based Feature Consistency Alignment (KCFCA), which is tailored to handle the distribution shifts of various datasets. KCFCA utilizes the K-means algorithm to cluster labeled training sets and unlabeled test sets, and then aligns the cluster centers with feature consistency. Secondly, we develop a dynamic regression model to capture the relationship between the shifts in distribution and model accuracy. Thirdly, we design an algorithm to discover the outlier model factors, eliminate the outlier models, and combine the strengths of multiple autoeval models. On the DataCV Challenge leaderboard, our approach secured 2nd place with an RMSE of 6.8526. Our method significantly improved over the best baseline method by 36\% (6.8526 vs. 10.7378). Furthermore, our method achieves a relatively more robust and optimal single model performance on the validation dataset.

artificial intelligence, dataset, machine learning, (15 more...)

arXiv.org Artificial Intelligence

2304.09758

Country:

Europe > Switzerland > Zürich > Zürich (0.14)
Asia > Indonesia > Bali (0.04)

Genre:

Research Report > Promising Solution (0.48)
Research Report > New Finding (0.47)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.89)

Add feedback

Exploring Unsupervised Learning Metrics - KDnuggets

#artificialintelligenceApr-14-2023, 15:30:43 GMT

Unsupervised learning is a branch of machine learning where the models learn patterns from the available data rather than provided with the actual label. We let the algorithm come up with the answers. In unsupervised learning, there are two main techniques; clustering and dimensionality reduction. The clustering technique uses an algorithm to learn the pattern to segment the data. In contrast, the dimensionality reduction technique tries to reduce the number of features by keeping the actual information intact as much as possible.

algorithm, dimensionality reduction, metric, (9 more...)

#artificialintelligence

Country: Asia > Indonesia (0.05)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.52)

Add feedback

CiPR: An Efficient Framework with Cross-instance Positive Relations for Generalized Category Discovery

Hao, Shaozhe, Han, Kai, Wong, Kwan-Yee K.

arXiv.org Artificial IntelligenceApr-14-2023

We tackle the issue of generalized category discovery (GCD). GCD considers the open-world problem of automatically clustering a partially labelled dataset, in which the unlabelled data contain instances from novel categories and also the labelled classes. In this paper, we address the GCD problem without a known category number in the unlabelled data. We propose a framework, named CiPR, to bootstrap the representation by exploiting Cross-instance Positive Relations for contrastive learning in the partially labelled data which are neglected in existing methods. First, to obtain reliable cross-instance relations to facilitate the representation learning, we introduce a semi-supervised hierarchical clustering algorithm, named selective neighbor clustering (SNC), which can produce a clustering hierarchy directly from the connected components in the graph constructed by selective neighbors. We also extend SNC to be capable of label assignment for the unlabelled instances with the given class number. Moreover, we present a method to estimate the unknown class number using SNC with a joint reference score considering clustering indexes of both labelled and unlabelled data. Finally, we thoroughly evaluate our framework on public generic image recognition datasets and challenging fine-grained datasets, all establishing the new state-of-the-art.

artificial intelligence, machine learning, unlabelled data, (15 more...)

arXiv.org Artificial Intelligence

2304.06928

Country: Asia > China > Hong Kong (0.04)

Genre: Research Report (0.64)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.88)

Add feedback

Uncovering the Inner Workings of STEGO for Safe Unsupervised Semantic Segmentation

Koenig, Alexander, Schambach, Maximilian, Otterbach, Johannes

arXiv.org Artificial IntelligenceApr-14-2023

Self-supervised pre-training strategies have recently shown impressive results for training general-purpose feature extraction backbones in computer vision. In combination with the Vision Transformer architecture, the DINO self-distillation technique has interesting emerging properties, such as unsupervised clustering in the latent space and semantic correspondences of the produced features without using explicit human-annotated labels. The STEGO method for unsupervised semantic segmentation contrastively distills feature correspondences of a DINO-pre-trained Vision Transformer and recently set a new state of the art. However, the detailed workings of STEGO have yet to be disentangled, preventing its usage in safety-critical applications. This paper provides a deeper understanding of the STEGO architecture and training strategy by conducting studies that uncover the working mechanisms behind STEGO, reproduce and extend its experimental validation, and investigate the ability of STEGO to transfer to different datasets. Results demonstrate that the STEGO architecture can be interpreted as a semantics-preserving dimensionality reduction technique.

artificial intelligence, data mining, machine learning, (18 more...)

arXiv.org Artificial Intelligence

2304.07314

Country:

Europe > Germany > Brandenburg > Potsdam (0.06)
North America > Canada > Ontario > Toronto (0.04)
Europe > Romania > Sud - Muntenia Development Region > Giurgiu County > Giurgiu (0.04)
Asia > Middle East > Jordan (0.04)

Genre: Research Report > New Finding (1.00)

Industry: Health & Medicine (0.46)

Technology:

Information Technology > Data Science > Data Mining (1.00)
Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.49)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.48)

Add feedback

Detection and Estimation of Structural Breaks in High-Dimensional Functional Time Series

Li, Degui, Li, Runze, Shang, Han Lin

arXiv.org Machine LearningApr-14-2023

Modelling functional time series, time series of random functions defined within a finite interval, has became one of the main frontiers of developments in time series models. Various functional linear and nonlinear time series models have been proposed and extensively studied in the past two decades (e.g., Bosq, 2000; Hörmann and Kokoszka, 2010; Horváth and Kokoszka, 2012; Hörmann, Horváth and Reeder, 2013; Li, Robinson and Shang, 2020). These models together with relevant methodologies have been applied to various fields such as biology, demography, economics, environmental science and finance. However, the model frameworks and methodologies developed in the aforementioned literature heavily rely on the stationarity assumption, which is often rejected when testing the functional time series data in practice. For example, Horváth, Kokoszka and Rice (2014) find evidence of nonstationarity for intraday price curves of some stocks collected in the US market; Aue, Rice and Sönmez (2018) reject the null hypothesis of stationarity for the temperature curves collected in Australia; and Li, Robinson and Shang (2023) reveal evidence of nonstationary feature for the functional time series constructed from the age-and sex-specific life-table death counts. It thus becomes imperative to test whether the collected functional time series are stationary. The primary interest of this paper is to test whether there exist structural breaks in the mean function over time and subsequently estimate locations of breaks if they do exist. There have been increasing interests on detecting and estimating structural breaks in functional time series. Broadly speaking, there are two types of detection techniques.

artificial intelligence, functional time sery, machine learning, (17 more...)

arXiv.org Machine Learning

2304.07003

Country:

Oceania > Australia (0.24)
North America > United States > New York (0.04)
Europe > France (0.04)
(33 more...)

Genre: Research Report (0.81)

Industry:

Banking & Finance > Trading (1.00)
Information Technology (0.93)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (0.67)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.46)

Add feedback

Task Adaptive Feature Transformation for One-Shot Learning

Ziko, Imtiaz Masud, Lecue, Freddy, Ayed, Ismail Ben

arXiv.org Artificial IntelligenceApr-13-2023

We introduce a simple non-linear embedding adaptation layer, which is fine-tuned on top of fixed pre-trained features for one-shot tasks, improving significantly transductive entropy-based inference for low-shot regimes. Our norm-induced transformation could be understood as a re-parametrization of the feature space to disentangle the representations of different classes in a task specific manner. It focuses on the relevant feature dimensions while hindering the effects of non-relevant dimensions that may cause overfitting in a one-shot setting. We also provide an interpretation of our proposed feature transformation in the basic case of few-shot inference with K-means clustering. Furthermore, we give an interesting bound-optimization link between K-means and entropy minimization. This emphasizes why our feature transformation is useful in the context of entropy minimization. We report comprehensive experiments, which show consistent improvements over a variety of one-shot benchmarks, outperforming recent state-of-the-art methods.

artificial intelligence, machine learning, transformation, (15 more...)

arXiv.org Artificial Intelligence

2304.06832

Country: North America > Canada > Quebec > Montreal (0.04)

Genre: Research Report (0.70)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.35)

Add feedback

MLOps Spanning Whole Machine Learning Life Cycle: A Survey

Zhengxin, Fang, Yi, Yuan, Jingyu, Zhang, Yue, Liu, Yuechen, Mu, Qinghua, Lu, Xiwei, Xu, Jeff, Wang, Chen, Wang, Shuai, Zhang, Shiping, Chen

arXiv.org Artificial IntelligenceApr-13-2023

Google AlphaGos win has significantly motivated and sped up machine learning (ML) research and development, which led to tremendous ML technical advances and wider adoptions in various domains (e.g., Finance, Health, Defense, and Education). These advances have resulted in numerous new concepts and technologies, which are too many for people to catch up to and even make them confused, especially for newcomers to the ML area. This paper is aimed to present a clear picture of the state-of-the-art of the existing ML technologies with a comprehensive survey. We lay out this survey by viewing ML as a MLOps (ML Operations) process, where the key concepts and activities are collected and elaborated with representative works and surveys. We hope that this paper can serve as a quick reference manual (a survey of surveys) for newcomers (e.g., researchers, practitioners) of ML to get an overview of the MLOps process, as well as a good understanding of the key technologies used in each step of the ML process, and know where to find more details.

evolutionary algorithm, machine learning life cycle, reinforcement learning, (16 more...)

arXiv.org Artificial Intelligence

2304.07296

Country:

North America > United States > New York > New York County > New York City (0.04)
Oceania > Australia > New South Wales > Sydney (0.04)
South America (0.04)
(7 more...)

Genre:

Workflow (1.00)
Research Report (1.00)
Overview (1.00)

Industry:

Information Technology > Security & Privacy (1.00)
Health & Medicine (1.00)
Education (1.00)
(4 more...)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Uncertainty (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (1.00)
(7 more...)

Add feedback

The growclusters Package for R

Powers, Randall, Martinez, Wendy, Savitsky, Terrance

arXiv.org Artificial IntelligenceApr-12-2023

The growclusters package for R implements an enhanced version of k-means clustering that allows discovery of local clusterings or partitions for a collection of data sets that each draw their cluster means from a single, global partition. The package contains functions to estimate a partition structure for multivariate data. Estimation is performed under a penalized optimization derived from Bayesian non-parametric formulations. This paper describes some of the functions and capabilities of the growclusters package, including the creation of R Shiny applications designed to visually illustrate the operation and functionality of the growclusters package.

application, artificial intelligence, machine learning, (15 more...)

arXiv.org Artificial Intelligence

2304.06145

Country:

North America > United States > Virginia > Alexandria County > Alexandria (0.05)
Europe (0.05)

Genre: Research Report (0.65)

Industry: Government > Regional Government > North America Government > United States Government (0.49)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.70)

Add feedback

Vec2GC -- A Graph Based Clustering Method for Text Representations

Rao, Rajesh N, Chakraborty, Manojit

arXiv.org Artificial IntelligenceApr-12-2023

NLP pipelines with limited or no labeled data, rely on unsupervised methods for document processing. Unsupervised approaches typically depend on clustering of terms or documents. In this paper, we introduce a novel clustering algorithm, Vec2GC (Vector to Graph Communities), an end-to-end pipeline to cluster terms or documents for any given text corpus. Our method uses community detection on a weighted graph of the terms or documents, created using text representation learning. Vec2GC clustering algorithm is a density based approach, that supports hierarchical clustering as well.

artificial intelligence, data mining, machine learning, (15 more...)

arXiv.org Artificial Intelligence

2104.09439

Country:

Asia > India > Karnataka > Bengaluru (0.05)
North America > United States > New York > New York County > New York City (0.04)
Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)

Genre: Research Report (0.50)

Technology:

Information Technology > Data Science > Data Mining (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (1.00)

Add feedback