AITopics | Rouditchenko, Andrew

Collaborating Authors

Rouditchenko, Andrew

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

AV-CPL: Continuous Pseudo-Labeling for Audio-Visual Speech Recognition

Rouditchenko, Andrew, Collobert, Ronan, Likhomanenko, Tatiana

arXiv.org Machine LearningSep-29-2023

Audio-visual speech contains synchronized audio and visual information that provides cross-modal supervision to learn representations for both automatic speech recognition (ASR) and visual speech recognition (VSR). We introduce continuous pseudo-labeling for audio-visual speech recognition (AV-CPL), a semi-supervised method to train an audio-visual speech recognition (AVSR) model on a combination of labeled and unlabeled videos with continuously regenerated pseudo-labels. Our models are trained for speech recognition from audio-visual inputs and can perform speech recognition using both audio and visual modalities, or only one modality. Our method uses the same audio-visual model for both supervised training and pseudo-label generation, mitigating the need for external speech recognition models to generate pseudo-labels. Finally, using visual-only speech data, our method is able to leverage unlabeled visual speech to improve VSR. Machine learning has enabled rapid advancement in fields such as speech processing. However, speech processing requires large amounts of labeled data to work well (Radford et al., 2023; Zheng et al., 2022), which is hard to acquire for the thousands of languages spoken world-wide. Semisupervised learning aims to mitigate this challenge by using unlabeled data to learn better representations and improve performance on labeled data. Real-world unlabeled data is often multi-modal, for example, videos containing synchronized audio and visual information. In this work, we investigate whether we can use such multi-modal data in a semi-supervised pipeline to improve performance on labeled data. Multi-modal data has an additional benefit - modalities can be complementary for each other and provide cross-modal supervision, which influences our algorithm design. In this work, we study audio-visual speech as multi-modal data with synchronized audio and visual input sequences. Using only the audio or the video data, we can perform two kinds of speech recognition: automatic speech recognition (ASR) from the audio channel, or visual speech recognition (VSR) from the video channel (lip-reading). However, these modalities require substantially different amounts of labeled data for training practical models. For example, with 30 hours of labeled data, we can train an ASR model which reaches around 11% word error rate (WER), while training modern end-to-end VSR models on the same amount of data is challenging: the lowest WER we achieve in our experiments is 96%.

artificial intelligence, machine learning, speech recognition, (17 more...)

arXiv.org Machine Learning

2309.17395

Country:

Oceania > Australia (0.14)
North America > Canada (0.14)
Europe > United Kingdom > Scotland (0.14)

Genre: Research Report > New Finding (0.88)

Industry: Education (0.68)

Technology:

Information Technology > Artificial Intelligence > Speech > Speech Recognition (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)

Add feedback

Comparison of Multilingual Self-Supervised and Weakly-Supervised Speech Pre-Training for Adaptation to Unseen Languages

Rouditchenko, Andrew, Khurana, Sameer, Thomas, Samuel, Feris, Rogerio, Karlinsky, Leonid, Kuehne, Hilde, Harwath, David, Kingsbury, Brian, Glass, James

arXiv.org Artificial IntelligenceMay-30-2023

Recent models such as XLS-R and Whisper have made multilingual speech technologies more accessible by pre-training on audio from around 100 spoken languages each. However, there are thousands of spoken languages worldwide, and adapting to new languages is an important problem. In this work, we aim to understand which model adapts better to languages unseen during pre-training. We fine-tune both models on 13 unseen languages and 18 seen languages. Our results show that the number of hours seen per language and language family during pre-training is predictive of how the models compare, despite the significant differences in the pre-training methods.

artificial intelligence, machine learning, unseen language, (16 more...)

arXiv.org Artificial Intelligence

2305.12606

Country:

Europe (0.46)
North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)

Genre: Research Report > New Finding (0.86)

Technology:

Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Artificial Intelligence > Speech > Speech Recognition (0.69)

Add feedback

C2KD: Cross-Lingual Cross-Modal Knowledge Distillation for Multilingual Text-Video Retrieval

Rouditchenko, Andrew, Chuang, Yung-Sung, Shvetsova, Nina, Thomas, Samuel, Feris, Rogerio, Kingsbury, Brian, Karlinsky, Leonid, Harwath, David, Kuehne, Hilde, Glass, James

arXiv.org Artificial IntelligenceMay-9-2023

Multilingual text-video retrieval methods have improved significantly in recent years, but the performance for other languages lags behind English. We propose a Cross-Lingual Cross-Modal Knowledge Distillation method to improve multilingual text-video retrieval. Inspired by the fact that English text-video retrieval outperforms other languages, we train a student model using input text in different languages to match the cross-modal predictions from teacher models using input text in English. We propose a cross entropy based objective which forces the distribution over the student's text-video similarity scores to be similar to those of the teacher models. We introduce a new multilingual video dataset, Multi-YouCook2, by translating the English captions in the YouCook2 video dataset to 8 other languages. Our method improves multilingual text-video retrieval performance on Multi-YouCook2 and several other datasets such as Multi-MSRVTT and VATEX. We also conducted an analysis on the effectiveness of different multilingual text models as teachers. The code, models, and dataset are available at https://github.com/roudimit/c2kd.

artificial intelligence, machine learning, natural language, (17 more...)

arXiv.org Artificial Intelligence

2210.03625

Country: North America > United States > Minnesota (0.28)

Genre: Research Report (0.82)

Industry: Education > Educational Technology (0.49)

Technology:

Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Artificial Intelligence > Vision (0.68)
Information Technology > Communications (0.68)

Add feedback

Label-efficient audio classification through multitask learning and self-supervision

Lee, Tyler, Gong, Ting, Padhy, Suchismita, Rouditchenko, Andrew, Ndirango, Anthony

arXiv.org Machine LearningOct-18-2019

Published as a conference paper at ICLR 2019L ABEL-EFFICIENT AUDIO CLASSIFICATION THROUGH MULTITASK LEARNING AND SELF - SUPERVISION Tyler Lee, null Ting Gong, null Suchismita Padhy, null & Anthony Ndirango null Intel AI Lab Santa Clara, CA {tyler.p.lee,ting.gong,suchismita.padhy,anthony.ndirango A BSTRACT While deep learning has been incredibly successful in modeling tasks with large, carefully curated labeled datasets, its application to problems with limited labeled data remains a challenge. The aim of the present work is to improve the label efficiency of large neural networks operating on audio data through a combination of multitask learning and self-supervised learning on unlabeled data. We trained an end-to-end audio feature extractor based on WaveNet that feeds into simple, yet versatile task-specific neural networks. We describe several easily implemented self-supervised learning tasks that can operate on any large, unlabeled audio corpus. We demonstrate that, in scenarios with limited labeled training data, one can significantly improve the performance of three different supervised classification tasks individually by up to 6% through simultaneous training with these additional self-supervised tasks. We also show that incorporating data augmentation into our multitask setting leads to even further gains in performance.

deep learning, neural network, unlabeled data, (19 more...)

arXiv.org Machine Learning

1910.12587

Country: North America > United States > California > Santa Clara County > Santa Clara (0.24)

Genre: Research Report (1.00)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.48)

Add feedback