k-means model
Data-Driven Discovery of Feature Groups in Clinical Time Series
Sergeev, Fedor, Burger, Manuel, Leshetkina, Polina, Fortuin, Vincent, Rätsch, Gunnar, Kuznetsova, Rita
Clinical time series data are critical for patient monitoring and predictive modeling. These time series are typically multivariate and often comprise hundreds of heterogeneous features from different data sources. The grouping of features based on similarity and relevance to the prediction task has been shown to enhance the performance of deep learning architectures. However, defining these groups a priori using only semantic knowledge is challenging, even for domain experts. To address this, we propose a novel method that learns feature groups by clustering weights of feature-wise embedding layers. This approach seamlessly integrates into standard supervised training and discovers the groups that directly improve downstream performance on clinically relevant tasks. We demonstrate that our method outperforms static clustering approaches on synthetic data and achieves performance comparable to expert-defined groups on real-world medical data. Moreover, the learned feature groups are clinically interpretable, enabling data-driven discovery of task-relevant relationships between variables.
SEF-MK: Speaker-Embedding-Free Voice Anonymization through Multi-k-means Quantization
Tang, Beilong, Miao, Xiaoxiao, Wang, Xin, Li, Ming
Voice anonymization protects speaker privacy by concealing identity while preserving linguistic and paralinguistic content. Self-supervised learning (SSL) representations encode linguistic features but preserve speaker traits. We propose a novel speaker-embedding-free framework called SEF-MK. Instead of using a single k-means model trained on the entire dataset, SEF-MK anonymizes SSL representations for each utterance by randomly selecting one of multiple k-means models, each trained on a different subset of speakers. We explore this approach from both attacker and user perspectives. Extensive experiments show that, compared to a single k-means model, SEF-MK with multiple k-means models better preserves linguistic and emotional content from the user's viewpoint. However, from the attacker's perspective, utilizing multiple k-means models boosts the effectiveness of privacy attacks. These insights can aid users in designing voice anonymization systems to mitigate attacker threats.
Seamless Language Expansion: Enhancing Multilingual Mastery in Self-Supervised Models
Xu, Jing, Wu, Minglin, Wu, Xixin, Meng, Helen
Self-supervised (SSL) models have shown great performance in various downstream tasks. However, they are typically developed for limited languages, and may encounter new languages in real-world. Developing a SSL model for each new language is costly. Thus, it is vital to figure out how to efficiently adapt existed SSL models to a new language without impairing its original abilities. We propose adaptation methods which integrate LoRA to existed SSL models to extend new language. We also develop preservation strategies which include data combination and re-clustering to retain abilities on existed languages. Applied to mHuBERT, we investigate their effectiveness on speech re-synthesis task. Experiments show that our adaptation methods enable mHuBERT to be applied to a new language (Mandarin) with MOS value increased about 1.6 and the relative value of WER reduced up to 61.72%. Also, our preservation strategies ensure that the performance on both existed and new languages remains intact.
Pac-HuBERT: Self-Supervised Music Source Separation via Primitive Auditory Clustering and Hidden-Unit BERT
Chen, Ke, Wichern, Gordon, Germain, François G., Roux, Jonathan Le
In spite of the progress in music source separation research, the small amount of publicly-available clean source data remains a constant limiting factor for performance. Thus, recent advances in self-supervised learning present a largely-unexplored opportunity for improving separation models by leveraging unlabelled music data. In this paper, we propose a self-supervised learning framework for music source separation inspired by the HuBERT speech representation model. We first investigate the potential impact of the original HuBERT model by inserting an adapted version of it into the well-known Demucs V2 time-domain separation model architecture. We then propose a time-frequency-domain self-supervised model, Pac-HuBERT (for primitive auditory clustering HuBERT), that we later use in combination with a Res-U-Net decoder for source separation. Pac-HuBERT uses primitive auditory features of music as unsupervised clustering labels to initialize the self-supervised pretraining process using the Free Music Archive (FMA) dataset. The resulting framework achieves better source-to-distortion ratio (SDR) performance on the MusDB18 test set than the original Demucs V2 and Res-U-Net models. We further demonstrate that it can boost performance with small amounts of supervised data. Ultimately, our proposed framework is an effective solution to the challenge of limited clean source data for music source separation.
Randomly Projected Convex Clustering Model: Motivation, Realization, and Cluster Recovery Guarantees
Wang, Ziwen, Yuan, Yancheng, Ma, Jiaming, Zeng, Tieyong, Sun, Defeng
In this paper, we propose a randomly projected convex clustering model for clustering a collection of $n$ high dimensional data points in $\mathbb{R}^d$ with $K$ hidden clusters. Compared to the convex clustering model for clustering original data with dimension $d$, we prove that, under some mild conditions, the perfect recovery of the cluster membership assignments of the convex clustering model, if exists, can be preserved by the randomly projected convex clustering model with embedding dimension $m = O(\epsilon^{-2}\log(n))$, where $0 < \epsilon < 1$ is some given parameter. We further prove that the embedding dimension can be improved to be $O(\epsilon^{-2}\log(K))$, which is independent of the number of data points. Extensive numerical experiment results will be presented in this paper to demonstrate the robustness and superior performance of the randomly projected convex clustering model. The numerical results presented in this paper also demonstrate that the randomly projected convex clustering model can outperform the randomly projected K-means model in practice.
Asymptotics for The $k$-means
Clustering is one of the most important unsupervised learning techniques for understanding the underlying data structures. The goal is to partition a data set into many subsets, called clusters, such that the observations within the subsets are the most homogeneous and the observations between the subsets are the most heterogeneous. Clustering is usually carried out by specifying a similarity or dissimilarity measure between observations. Examples include the k-means [17, 19, 29, 37], the k-medians [3], the k-modes [5], and the generalized k-means [2, 31, 45], as well as many of their modifications [21, 24, 42]. Among those, the k-means has been considered as one of the most straightforward and popular methods since it was proposed sixty years ago [23, 36]. Although it is well known, the investigation of the theoretical properties is still far behind, leading to difficulties in developing more precise k-means methods in practice. The goal of the present research is to propose a new concept called clustering consistency for the asymptotics of the k-means with a resulting clustering method better than the existing k-means methods adopted by many software packages, including those adopted by R and Python.
A Comparative Study of Self-supervised Speech Representation Based Voice Conversion
Huang, Wen-Chin, Yang, Shu-Wen, Hayashi, Tomoki, Toda, Tomoki
We present a large-scale comparative study of self-supervised speech representation (S3R)-based voice conversion (VC). In the context of recognition-synthesis VC, S3Rs are attractive owing to their potential to replace expensive supervised representations such as phonetic posteriorgrams (PPGs), which are commonly adopted by state-of-the-art VC systems. Using S3PRL-VC, an open-source VC software we previously developed, we provide a series of in-depth objective and subjective analyses under three VC settings: intra-/cross-lingual any-to-one (A2O) and any-to-any (A2A) VC, using the voice conversion challenge 2020 (VCC2020) dataset. We investigated S3R-based VC in various aspects, including model type, multilinguality, and supervision. We also studied the effect of a post-discretization process with k-means clustering and showed how it improves in the A2A setting. Finally, the comparison with state-of-the-art VC systems demonstrates the competitiveness of S3R-based VC and also sheds light on the possible improving directions.
K-Means Clustering Project -- Banknote Authentication
Have you ever been in a situation where you were handing money to the clerks at a supermarket only to find that the money is fake while there was a long line of people behind you waiting to check out? I personally had experienced this situation one time and that embarrassment of being assumed to be an immoral cheapskate just stuck in my head for a long time. This motivated me to conduct this project, building a K-Means Clustering model to detect if a banknote is real or fake. This dataset is about distinguishing genuine and forged banknotes. Data were extracted from images that were taken from genuine and forged banknote-like specimens.
How to Build Audience Clusters With Website Data Using BigQuery ML
A common marketing analytics challenge is to understand consumer behavior and develop customer attributes or archetypes. As organizations get better at tackling this problem, they can activate marketing strategies to incorporate additional customer knowledge into their campaigns. Building customer profiles is now easier than ever with BigQuery ML, using a technique called clustering. In this post, you'll learn how to create segmentation and how to use these audiences for marketing activation. Clustering algorithms can group similar user behavior together to build segmentation used for marketing.
Deep Clustering for Financial Market Segmentation
Unsupervised learning, supervised learning and reinforcement learning are three main categories of machine learning methods. Unsupervised learning has many applications such as clustering, dimensionality reduction, etc. The machine learning algorithms K-means and Principal Component Analysis (PCA) are widely used for clustering and dimensionality reduction respectively. Similarly to PCA, the T-distributed Stochastic Neighbor Embedding (t-SNE) is another unsupervised machine learning algorithm for dimensionality reduction. With the advancement of unsupervised deep learning, the Autoencoder neural network is now frequently used for high dimensionality (e.g., a dataset with thousands or more features) reduction. Autoencoder can also be combined with supervised learning (e.g., Random Forest) to form Semi-supervised learning method (see deep patient as an example).