Speech Sequence Embeddings using Nearest Neighbors Contrastive Learning

Algayres, Robin, Nabli, Adel, Sagot, Benoit, Dupoux, Emmanuel

Oct-21-2023–arXiv.org Artificial Intelligence

Building on similar ideas in vision and speech, we select our positive examples through a We introduce a simple neural encoder architecture that can mix of time-stretching data augmentation [26] and k-Nearerst be trained using an unsupervised contrastive learning objective Neighbors search [27, 28]. Figure 1 gives an overview of our which gets its positive samples from data-augmented k-Nearest method. To evaluate our method, we test our model on 5 types Neighbors search. We show that when built on top of recent of acoustic features: MFCCs, CPC [4, 3] HuBERT [1] and self-supervised audio representations [1, 2, 3], this method can Wav2Vec 2.0 (Base and Large) [2]. We pick the best method be applied iteratively and yield competitive SSE as evaluated on from our LibriSpeech benchmark and show that when applied two tasks: query-by-example of random sequences of speech, without any change to the task of spoken term discovery as defined and spoken term discovery. On both tasks our method pushes in the zero resource challenges [29], we beat the state of the state-of-the-art by a significant margin across 5 different the art on the NED/COV metric by a large margin in 5 new languages. Finally, we establish a benchmark on a query-byexample datasets.

available, machine learning, natural language, (18 more...)

arXiv.org Artificial Intelligence

Oct-21-2023

arXiv.org PDF

Add feedback

Genre:
- Research Report (0.68)

Technology:
- Information Technology > Artificial Intelligence
  - Machine Learning
    - Neural Networks > Deep Learning (0.46)
    - Statistical Learning (1.00)
  - Natural Language (1.00)
  - Representation & Reasoning (1.00)
  - Speech (1.00)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found