AITopics | acoustic unit discovery

SAHMM-VAE: A Source-Wise Adaptive Hidden Markov Prior Variational Autoencoder for Unsupervised Blind Source Separation

arXiv.org Machine LearningMar-30-2026

We propose SAHMM-VAE, a source-wise adaptive Hidden Markov prior variational autoencoder for unsupervised blind source separation. Instead of treating the latent prior as a single generic regularizer, the proposed framework assigns each latent dimension its own adaptive regime-switching prior, so that different latent dimensions are pulled toward different source-specific temporal organizations during training. Under this formulation, source separation is not implemented as an external post-processing step; it is embedded directly into variational learning itself. The encoder, decoder, posterior parameters, and source-wise prior parameters are optimized jointly, where the encoder progressively learns an inference map that behaves like an approximate inverse of the mixing transformation, while the decoder plays the role of the generative mixing model. Through this coupled optimization, the gradual alignment between posterior source trajectories and heterogeneous HMM priors becomes the mechanism through which different latent dimensions separate into different source components. To instantiate this idea, we develop three branches within one common framework: a Gaussian-emission HMM prior, a Markov-switching autoregressive HMM prior, and an HMM state-flow prior with state-wise autoregressive flow transformations. Experiments show that the proposed framework achieves unsupervised source recovery while also learning meaningful source-wise switching structures. More broadly, the method extends our structured-prior VAE line from smooth, mixture-based, and flow-based latent priors to adaptive switching priors, and provides a useful basis for future work on interpretable and potentially identifiable latent source modeling.

artificial intelligence, dimension, machine learning, (14 more...)

arXiv.org Machine Learning

2603.25776

Country:

Asia > Japan > Honshū > Tōhoku > Iwate Prefecture > Morioka (0.05)
Asia > Middle East > Jordan (0.04)

Genre: Research Report (0.64)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Undirected Networks > Markov Models (0.91)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.87)

Add feedback

Variable-rate hierarchical CPC leads to acoustic unit discovery in speech

Neural Information Processing SystemsDec-25-2025, 13:07:35 GMT

The success of deep learning comes from its ability to capture the hierarchical structure of data by learning high-level representations defined in terms of low-level ones. In this paper we explore self-supervised learning of hierarchical representations of speech by applying multiple levels of Contrastive Predictive Coding (CPC). We observe that simply stacking two CPC models does not yield significant improvements over single-level architectures. Inspired by the fact that speech is often described as a sequence of discrete units unevenly distributed in time, we propose a model in which the output of a low-level CPC module is non-uniformly downsampled to directly minimize the loss of a high-level CPC module. The latter is designed to also enforce a prior of separability and discreteness in its representations by enforcing dissimilarity of successive high-level representations through focused negative sampling, and by quantization of the prediction targets. Accounting for the structure of the speech signal improves upon single-level CPC features and enhances the disentanglement of the learned representations, as measured by downstream speech recognition tasks, while resulting in a meaningful segmentation of the signal that closely resembles phone boundaries.

acoustic unit discovery, representation, variable-rate hierarchical cpc, (5 more...)

Neural Information Processing Systems

Country: South America > Chile > Santiago Metropolitan Region > Santiago Province > Santiago (0.07)

Technology:

Information Technology > Artificial Intelligence > Machine Learning (0.77)
Information Technology > Artificial Intelligence > Speech (0.60)

Add feedback

Variable-rate hierarchical CPC leads to acoustic unit discovery in speech

Neural Information Processing SystemsAug-12-2025, 22:37:16 GMT

The success of deep learning comes from its ability to capture the hierarchical structure of data by learning high-level representations defined in terms of low-level ones. In this paper we explore self-supervised learning of hierarchical representations of speech by applying multiple levels of Contrastive Predictive Coding (CPC). We observe that simply stacking two CPC models does not yield significant improvements over single-level architectures. Inspired by the fact that speech is often described as a sequence of discrete units unevenly distributed in time, we propose a model in which the output of a low-level CPC module is non-uniformly downsampled to directly minimize the loss of a high-level CPC module. The latter is designed to also enforce a prior of separability and discreteness in its representations by enforcing dissimilarity of successive high-level representations through focused negative sampling, and by quantization of the prediction targets.

artificial intelligence, machine learning, representation, (5 more...)

Neural Information Processing Systems

Country: South America > Chile > Santiago Metropolitan Region > Santiago Province > Santiago (0.09)

Technology: Information Technology > Artificial Intelligence > Machine Learning (0.81)

Add feedback

Variable-rate hierarchical CPC leads to acoustic unit discovery in speech

Neural Information Processing SystemsJan-19-2025, 03:27:16 GMT

The success of deep learning comes from its ability to capture the hierarchical structure of data by learning high-level representations defined in terms of low-level ones. In this paper we explore self-supervised learning of hierarchical representations of speech by applying multiple levels of Contrastive Predictive Coding (CPC). We observe that simply stacking two CPC models does not yield significant improvements over single-level architectures. Inspired by the fact that speech is often described as a sequence of discrete units unevenly distributed in time, we propose a model in which the output of a low-level CPC module is non-uniformly downsampled to directly minimize the loss of a high-level CPC module. The latter is designed to also enforce a prior of separability and discreteness in its representations by enforcing dissimilarity of successive high-level representations through focused negative sampling, and by quantization of the prediction targets.

acoustic unit discovery, representation, variable-rate hierarchical cpc, (3 more...)

Neural Information Processing Systems

Country: South America > Chile > Santiago Metropolitan Region > Santiago Province > Santiago (0.09)

Technology: Information Technology > Artificial Intelligence > Machine Learning (0.81)

Add feedback

Revisiting speech segmentation and lexicon learning with better features

Kamper, Herman, van Niekerk, Benjamin

arXiv.org Artificial IntelligenceJan-31-2024

We revisit a self-supervised method that segments unlabelled speech into word-like segments. We start from the two-stage duration-penalised dynamic programming method that performs zero-resource segmentation without learning an explicit lexicon. In the first acoustic unit discovery stage, we replace contrastive predictive coding features with HuBERT. After word segmentation in the second stage, we get an acoustic word embedding for each segment by averaging HuBERT features. These embeddings are clustered using K-means to get a lexicon. The result is good full-coverage segmentation with a lexicon that achieves state-of-the-art performance on the ZeroSpeech benchmarks.

lexicon, segmentation, sequence, (11 more...)

arXiv.org Artificial Intelligence

2401.17902

Country: Africa > South Africa (0.05)

Genre: Research Report (0.64)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.53)
Information Technology > Artificial Intelligence > Speech > Acoustic Processing (0.41)

Add feedback

Word Segmentation on Discovered Phone Units with Dynamic Programming and Self-Supervised Scoring

Kamper, Herman

arXiv.org Artificial IntelligenceJan-9-2023

Recent work on unsupervised speech segmentation has used self-supervised models with phone and word segmentation modules that are trained jointly. This paper instead revisits an older approach to word segmentation: bottom-up phone-like unit discovery is performed first, and symbolic word segmentation is then performed on top of the discovered units (without influencing the lower level). To do this, I propose a new unit discovery model, a new symbolic word segmentation model, and then chain the two models to segment speech. Both models use dynamic programming to minimize segment costs from a self-supervised network with an additional duration penalty that encourages longer units. Concretely, for acoustic unit discovery, duration-penalized dynamic programming (DPDP) is used with a contrastive predictive coding model as the scoring network. For word segmentation, DPDP is applied with an autoencoding recurrent neural as the scoring network. The two models are chained in order to segment speech. This approach gives comparable word segmentation results to state-of-the-art joint self-supervised segmentation models on an English benchmark. On French, Mandarin, German and Wolof data, it outperforms previous systems on the ZeroSpeech benchmarks. Analysis shows that the chained DPDP system segments shorter filler words well, but longer words might require some external top-down signal.

machine learning, natural language, segmentation, (17 more...)

arXiv.org Artificial Intelligence

doi: 10.1109/TASLP.2022.3229264

2202.11929

Country: Africa > South Africa (0.04)

Genre: Research Report (1.00)

Technology:

Information Technology > Artificial Intelligence > Speech > Speech Recognition (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Optimization (1.00)
Information Technology > Artificial Intelligence > Natural Language > Grammars & Parsing (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Undirected Networks > Markov Models (0.46)

Add feedback

Variable-rate hierarchical CPC leads to acoustic unit discovery in speech

Cuervo, Santiago, Łańcucki, Adrian, Marxer, Ricard, Rychlikowski, Paweł, Chorowski, Jan

arXiv.org Artificial IntelligenceDec-4-2022

The success of deep learning comes from its ability to capture the hierarchical structure of data by learning high-level representations defined in terms of low-level ones. In this paper we explore self-supervised learning of hierarchical representations of speech by applying multiple levels of Contrastive Predictive Coding (CPC). We observe that simply stacking two CPC models does not yield significant improvements over single-level architectures. Inspired by the fact that speech is often described as a sequence of discrete units unevenly distributed in time, we propose a model in which the output of a low-level CPC module is non-uniformly downsampled to directly minimize the loss of a high-level CPC module. The latter is designed to also enforce a prior of separability and discreteness in its representations by enforcing dissimilarity of successive high-level representations through focused negative sampling, and by quantization of the prediction targets. Accounting for the structure of the speech signal improves upon single-level CPC features and enhances the disentanglement of the learned representations, as measured by downstream speech recognition tasks, while resulting in a meaningful segmentation of the signal that closely resembles phone boundaries.

artificial intelligence, machine learning, natural language, (16 more...)

arXiv.org Artificial Intelligence

2206.02211

Country:

South America > Chile > Santiago Metropolitan Region > Santiago Province > Santiago (0.04)
Europe > Poland > Lower Silesia Province > Wroclaw (0.04)
Europe > France > Provence-Alpes-Côte d'Azur > Bouches-du-Rhône > Marseille (0.04)

Genre: Research Report (0.64)

Technology:

Information Technology > Artificial Intelligence > Speech (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

A Temporal Extension of Latent Dirichlet Allocation for Unsupervised Acoustic Unit Discovery

van der Merwe, Werner, Kamper, Herman, Preez, Johan du

arXiv.org Machine LearningJun-29-2022

Latent Dirichlet allocation (LDA) is widely used for unsupervised topic modelling on sets of documents. No temporal information is used in the model. However, there is often a relationship between the corresponding topics of consecutive tokens. In this paper, we present an extension to LDA that uses a Markov chain to model temporal information. We use this new model for acoustic unit discovery from speech. As input tokens, the model takes a discretised encoding of speech from a vector quantised (VQ) neural network with 512 codes. The goal is then to map these 512 VQ codes to 50 phone-like units (topics) in order to more closely resemble true phones. In contrast to the base LDA, which only considers how VQ codes co-occur within utterances (documents), the Markov chain LDA additionally captures how consecutive codes follow one another. This extension leads to an increase in cluster quality and phone segmentation results compared to the base LDA. Compared to a recent vector quantised neural network approach that also learns 50 units, the extended LDA model performs better in phone segmentation but worse in mutual information.

information, machine learning, natural language, (16 more...)

arXiv.org Machine Learning

2206.11706

Country:

Africa > South Africa (0.04)
Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
Asia > Middle East > Jordan (0.04)

Genre: Research Report (0.40)

Technology:

Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Undirected Networks > Markov Models (0.93)

Add feedback

Bayesian Subspace HMM for the Zerospeech 2020 Challenge

Yusuf, Bolaji, Ondel, Lucas

arXiv.org Machine LearningJul-27-2020

In this paper we describe our submission to the Zerospeech 2020 challenge, where the participants are required to discover latent representations from unannotated speech, and to use those representations to perform speech synthesis, with synthesis quality used as a proxy metric for the unit quality. In our system, we use the Bayesian Subspace Hidden Markov Model (SHMM) for unit discovery. The SHMM models each unit as an HMM whose parameters are constrained to lie in a low dimensional subspace of the total parameter space which is trained to model phonetic variability. Our system compares favorably with the baseline on the human-evaluated character error rate while maintaining significantly lower unit bitrate.

bitrate, interspeech, representation, (9 more...)

arXiv.org Machine Learning

2005.09282

Country:

Europe > Czechia > South Moravian Region > Brno (0.04)
Europe > Germany > Baden-Württemberg > Karlsruhe Region > Karlsruhe (0.04)
Asia > Middle East > Republic of Türkiye (0.04)
Asia > Middle East > Jordan (0.04)

Genre: Research Report (0.50)

Industry: Government (0.47)

Technology:

Information Technology > Artificial Intelligence > Speech > Speech Recognition (0.95)
Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Undirected Networks > Markov Models (0.92)

Add feedback

Bayesian Subspace Hidden Markov Model for Acoustic Unit Discovery

Ondel, Lucas, Vydana, Hari Krishna, Burget, Lukáš, Černocký, Jan

arXiv.org Machine LearningApr-8-2019

This work tackles the problem of learning a set of language specific acoustic units from unlabeled speech recordings given a set of labeled recordings from other languages. Our approach may be described by the following two steps procedure: first the model learns the notion of acoustic units from the labelled data and then the model uses its knowledge to find new acoustic units on the target language. We implement this process with the Bayesian Subspace Hidden Markov Model (SHMM), a model akin to the Subspace Gaussian Mixture Model (SGMM) where each low dimensional embedding represents an acoustic unit rather than just a HMM's state. The subspace is trained on 3 languages from the GlobalPhone corpus (German, Polish and Spanish) and the AUs are discovered on the TIMIT corpus. Results, measured in equivalent Phone Error Rate, show that this approach significantly outperforms previous HMM based acoustic units discovery systems and compares favorably with the Variational Auto Encoder-HMM.

artificial intelligence, machine learning, subspace, (12 more...)

arXiv.org Machine Learning

1904.03876

Country: