Goto

Collaborating Authors

 clusterer


Rock the KASBA: Blazingly Fast and Accurate Time Series Clustering

arXiv.org Artificial Intelligence

Time series data has become increasingly prevalent across numerous domains, driving a growing demand for time series machine learning techniques. Among these, time series clustering (TSCL) stands out as one of the most popular machine learning tasks. TSCL serves as a powerful exploratory analysis tool and is also employed as a preprocessing step or subroutine for various tasks, including anomaly detection, segmentation, and classification. The most popular TSCL algorithms are either fast (in terms of run time) but perform poorly on benchmark problems, or perform well on benchmarks but scale poorly. We present a new TSCL algorithm, the $k$-means (K) accelerated (A) Stochastic subgradient (S) Barycentre (B) Average (A) (KASBA) clustering algorithm. KASBA is a $k$-means clustering algorithm that uses the Move-Split-Merge (MSM) elastic distance at all stages of clustering, applies a randomised stochastic subgradient gradient descent to find barycentre centroids, links each stage of clustering to accelerate convergence and exploits the metric property of MSM distance to avoid a large proportion of distance calculations. It is a versatile and scalable clusterer designed for real-world TSCL applications. It allows practitioners to balance run time and clustering performance. We demonstrate through extensive experimentation that KASBA produces significantly better clustering than the faster state of the art clusterers and is offers orders of magnitude improvement in run time over the most performant $k$-means alternatives.


An Empirical Study into Clustering of Unseen Datasets with Self-Supervised Encoders

arXiv.org Artificial Intelligence

Self-supervised learning (SSL) has attracted great interest in recent years across almost every machine learning sub-field, due to the promise of being able to harness large quantities of unlabelled data and obtaining generic feature embeddings useful for a variety of downstream tasks (Balestriero et al., 2023). This has, for example, led to the development of impressive large language models (Brown et al., 2020) and computer vision systems trained on 1 billion images (Goyal et al., 2021). However, while the embeddings from an SSL-trained encoder can perform well on downstream tasks after fine-tuning the network, there has been less investigation into the utility of the embeddings without fine-tuning. Prior work (Vaze et al., 2022; Zhou and Zhang, 2022) suggests SSL feature encoders generate embeddings suitable for clustering, but nonetheless adjust the feature encoders through fine-tuning. Yet, widespread interest in the application of large pretrained models on custom datasets, combined with prohibitive cost of compute, make this question important and increasingly urgent. We find that to date there has been no investigation into whether SSL-trained feature encoders can serve as a foundation for clustering, yielding informative groupings of embeddings on real-world datasets that were totally unseen to the encoder during its training. Vaze et al. (2023) showed that features from SSL encoders are typically biased toward shape features and not color, texture, or count when clustered using K-Means. However, this was conducted using a synthetic dataset, where very specific object attributes could be disentangled.


Masked LoGoNet: Fast and Accurate 3D Image Analysis for Medical Domain

arXiv.org Artificial Intelligence

Standard modern machine-learning-based imaging methods have faced challenges in medical applications due to the high cost of dataset construction and, thereby, the limited labeled training data available. Additionally, upon deployment, these methods are usually used to process a large volume of data on a daily basis, imposing a high maintenance cost on medical facilities. In this paper, we introduce a new neural network architecture, termed LoGoNet, with a tailored self-supervised learning (SSL) method to mitigate such challenges. LoGoNet integrates a novel feature extractor within a U-shaped architecture, leveraging Large Kernel Attention (LKA) and a dual encoding strategy to capture both long-range and short-range feature dependencies adeptly. This is in contrast to existing methods that rely on increasing network capacity to enhance feature extraction. This combination of novel techniques in our model is especially beneficial in medical image segmentation, given the difficulty of learning intricate and often irregular body organ shapes, such as the spleen. Complementary, we propose a novel SSL method tailored for 3D images to compensate for the lack of large labeled datasets. The method combines masking and contrastive learning techniques within a multi-task learning framework and is compatible with both Vision Transformer (ViT) and CNN-based models. We demonstrate the efficacy of our methods in numerous tasks across two standard datasets (i.e., BTCV and MSD). Benchmark comparisons with eight state-of-the-art models highlight LoGoNet's superior performance in both inference time and accuracy.


Highly Efficient Real-Time Streaming and Fully On-Device Speaker Diarization with Multi-Stage Clustering

arXiv.org Artificial Intelligence

While recent research advances in speaker diarization mostly focus on improving the quality of diarization results, there is also an increasing interest in improving the efficiency of diarization systems. In this paper, we demonstrate that a multi-stage clustering strategy that uses different clustering algorithms for input of different lengths can address multi-faceted challenges of on-device speaker diarization applications. Specifically, a fallback clusterer is used to handle short-form inputs; a main clusterer is used to handle medium-length inputs; and a pre-clusterer is used to compress long-form inputs before they are processed by the main clusterer. Both the main clusterer and the pre-clusterer can be configured with an upper bound of the computational complexity to adapt to devices with different resource constraints. This multi-stage clustering strategy is critical for streaming on-device speaker diarization systems, where the budgets of CPU, memory and battery are tight.


Model-based clustering using non-parametric Hidden Markov Models

arXiv.org Machine Learning

Thanks to their dependency structure, non-parametric Hidden Markov Models (HMMs) are able to handle model-based clustering without specifying group distributions. The aim of this work is to study the Bayes risk of clustering when using HMMs and to propose associated clustering procedures. We first give a result linking the Bayes risk of classification and the Bayes risk of clustering, which we use to identify the key quantity determining the difficulty of the clustering task. We also give a proof of this result in the i.i.d. framework, which might be of independent interest. Then we study the excess risk of the plugin classifier. All these results are shown to remain valid in the online setting where observations are clustered sequentially. Simulations illustrate our findings.


A Review and Evaluation of Elastic Distance Functions for Time Series Clustering

arXiv.org Artificial Intelligence

Time series clustering is the act of grouping time series data without recourse to a label. Algorithms that cluster time series can be classified into two groups: those that employ a time series specific distance measure; and those that derive features from time series. Both approaches usually rely on traditional clustering algorithms such as $k$-means. Our focus is on distance based time series that employ elastic distance measures, i.e. distances that perform some kind of realignment whilst measuring distance. We describe nine commonly used elastic distance measures and compare their performance with k-means and k-medoids clustering. Our findings are surprising. The most popular technique, dynamic time warping (DTW), performs worse than Euclidean distance with k-means, and even when tuned, is no better. Using k-medoids rather than k-means improved the clusterings for all nine distance measures. DTW is not significantly better than Euclidean distance with k-medoids. Generally, distance measures that employ editing in conjunction with warping perform better, and one distance measure, the move-split-merge (MSM) method, is the best performing measure of this study. We also compare to clustering with DTW using barycentre averaging (DBA). We find that DBA does improve DTW k-means, but that the standard DBA is still worse than using MSM. Our conclusion is to recommend MSM with k-medoids as the benchmark algorithm for clustering time series with elastic distance measures. We provide implementations in the aeon toolkit, results and guidance on reproducing results on the associated GitHub repository.


Earth Engine Tutorial #31: Machine Learning with Earth Engine - Unsupervised Classification

#artificialintelligence

This tutorial shows you how to perform unsupervised classification (e.g., KMeans clustering) in Earth Engine. The ee.Clusterer package handles unsupervised classification (or clustering) in Earth Engine. These algorithms are currently based on the algorithms with the same name in Weka. More details about each Clusterer are available in the reference docs in the Code Editor. Clusterers are used in the same manner as classifiers in Earth Engine.


Optimal Clustering under Uncertainty

arXiv.org Machine Learning

Classical clustering algorithms typically either lack an underlying probability framework to make them predictive or focus on parameter estimation rather than defining and minimizing a notion of error. Recent work addresses these issues by developing a probabilistic framework based on the theory of random labeled point processes and characterizing a Bayes clusterer that minimizes the number of misclustered points. The Bayes clusterer is analogous to the Bayes classifier. Whereas determining a Bayes classifier requires full knowledge of the feature-label distribution, deriving a Bayes clusterer requires full knowledge of the point process. When uncertain of the point process, one would like to find a robust clusterer that is optimal over the uncertainty, just as one may find optimal robust classifiers with uncertain feature-label distributions. Herein, we derive an optimal robust clusterer by first finding an effective random point process that incorporates all randomness within its own probabilistic structure and from which a Bayes clusterer can be derived that provides an optimal robust clusterer relative to the uncertainty. This is analogous to the use of effective class-conditional distributions in robust classification. After evaluating the performance of robust clusterers in synthetic mixtures of Gaussians models, we apply the framework to granular imaging, where we make use of the asymptotic granulometric moment theory for granular images to relate robust clustering theory to the application.