Not enough data to create a plot.
Try a different view from the menu above.
Thomas, Samuel
A Non-autoregressive Model for Joint STT and TTS
Sunder, Vishal, Kingsbury, Brian, Saon, George, Thomas, Samuel, Aronowitz, Slava Shechtman Hagai, Fosler-Lussier, Eric, Lastras, Luis
In this paper, we take a step towards jointly modeling automatic speech recognition (STT) and speech synthesis (TTS) in a fully non-autoregressive way. We develop a novel multimodal framework capable of handling the speech and text modalities as input either individually or together. The proposed model can also be trained with unpaired speech or text data owing to its multimodal nature. We further propose an iterative refinement strategy to improve the STT and TTS performance of our model such that the partial hypothesis at the output can be fed back to the input of our model, thus iteratively improving both STT and TTS predictions. We show that our joint model can effectively perform both STT and TTS tasks, outperforming the STT-specific baseline in all tasks and performing competitively with the TTS-specific baseline across a wide range of evaluation metrics.
Comparison of Multilingual Self-Supervised and Weakly-Supervised Speech Pre-Training for Adaptation to Unseen Languages
Rouditchenko, Andrew, Khurana, Sameer, Thomas, Samuel, Feris, Rogerio, Karlinsky, Leonid, Kuehne, Hilde, Harwath, David, Kingsbury, Brian, Glass, James
Recent models such as XLS-R and Whisper have made multilingual speech technologies more accessible by pre-training on audio from around 100 spoken languages each. However, there are thousands of spoken languages worldwide, and adapting to new languages is an important problem. In this work, we aim to understand which model adapts better to languages unseen during pre-training. We fine-tune both models on 13 unseen languages and 18 seen languages. Our results show that the number of hours seen per language and language family during pre-training is predictive of how the models compare, despite the significant differences in the pre-training methods.
C2KD: Cross-Lingual Cross-Modal Knowledge Distillation for Multilingual Text-Video Retrieval
Rouditchenko, Andrew, Chuang, Yung-Sung, Shvetsova, Nina, Thomas, Samuel, Feris, Rogerio, Kingsbury, Brian, Karlinsky, Leonid, Harwath, David, Kuehne, Hilde, Glass, James
Multilingual text-video retrieval methods have improved significantly in recent years, but the performance for other languages lags behind English. We propose a Cross-Lingual Cross-Modal Knowledge Distillation method to improve multilingual text-video retrieval. Inspired by the fact that English text-video retrieval outperforms other languages, we train a student model using input text in different languages to match the cross-modal predictions from teacher models using input text in English. We propose a cross entropy based objective which forces the distribution over the student's text-video similarity scores to be similar to those of the teacher models. We introduce a new multilingual video dataset, Multi-YouCook2, by translating the English captions in the YouCook2 video dataset to 8 other languages. Our method improves multilingual text-video retrieval performance on Multi-YouCook2 and several other datasets such as Multi-MSRVTT and VATEX. We also conducted an analysis on the effectiveness of different multilingual text models as teachers. The code, models, and dataset are available at https://github.com/roudimit/c2kd.
SimplerVoice: A Key Message & Visual Description Generator System for Illiteracy
Nguyen, Minh N. B., Thomas, Samuel, Gattiker, Anne E., Kashyap, Sujatha, Varshney, Kush R.
We introduce SimplerVoice: a key message and visual description generator system to help low-literate adults navigate the information-dense world with confidence, on their own. SimplerVoice can automatically generate sensible sentences describing an unknown object, extract semantic meanings of the object usage in the form of a query string, then, represent the string as multiple types of visual guidance (pictures, pictographs, etc.). We demonstrate SimplerVoice system in a case study of generating grocery products' manuals through a mobile application. To evaluate, we conducted a user study on SimplerVoice's generated description in comparison to the information interpreted by users from other methods: the original product package and search engines' top result, in which SimplerVoice achieved the highest performance score: 4.82 on 5-point mean opinion score scale. Our result shows that SimplerVoice is able to provide low-literate end-users with simple yet informative components to help them understand how to use the grocery products, and that the system may potentially provide benefits in other real-world use cases.
Invariant Representations for Noisy Speech Recognition
Serdyuk, Dmitriy, Audhkhasi, Kartik, Brakel, Philémon, Ramabhadran, Bhuvana, Thomas, Samuel, Bengio, Yoshua
Modern automatic speech recognition (ASR) systems need to be robust under acoustic variability arising from environmental, speaker, channel, and recording conditions. Ensuring such robustness to variability is a challenge in modern day neural network-based ASR systems, especially when all types of variability are not seen during training. We attempt to address this problem by encouraging the neural network acoustic model to learn invariant feature representations. We use ideas from recent research on image generation using Generative Adversarial Networks and domain adaptation ideas extending adversarial gradient-based training. A recent work from Ganin et al. proposes to use adversarial training for image domain adaptation by using an intermediate representation from the main target classification network to deteriorate the domain classifier performance through a separate neural network. Our work focuses on investigating neural architectures which produce representations invariant to noise conditions for ASR. We evaluate the proposed architecture on the Aurora-4 task, a popular benchmark for noise robust ASR. We show that our method generalizes better than the standard multi-condition training especially when only a few noise categories are seen during training.
Compiling Constraint Networks into Multivalued Decomposable Decision Graphs
Koriche, Frédéric (CRIL-CNRS and Université d'Artois) | Lagniez, Jean-Marie (CRIL-CNRS and Université d'Artois) | Marquis, Pierre (CRIL-CNRS and Université d'Artois) | Thomas, Samuel (CRIL-CNRS and Université d'Artois)
Specifically, we present a top-down algorithm cn2mddg for compiling finite-domain CNs into multivalued decomposable We present and evaluate a top-down algorithm for decision graphs. The input of cn2mddg is a CN compiling finite-domain constraint networks (CNs) represented in the XCSP 2.1 format [Roussel and Lecoutre, into the language MDDG of multivalued decomposable 2009]. The output of our compilation algorithm is a representation decision graphs. Though it includes Decision-of the solutions of the CN in the language MDDG DNNF as a proper subset, MDDG offers the same key of multivalued decomposable decision graphs. MDDG is precisely tractable queries and transformations as Decision-the extension to non-Boolean domains of the language DNNF, which makes it useful for many applications. DDG [Fargier and Marquis, 2006] also known as Decision-Intensive experiments showed that our compiler DNNF [Oztok and Darwiche, 2014]: it is based on decomposable cn2mddg succeeds in compiling CNs which -nodes and (multivalued) decision nodes. Similarly are out of the reach of standard approaches based to Decision-DNNF, the MDDG language offers a number of on a translation of the input network to CNF, followed tractable queries, including (possibly weighted) solution finding by a compilation to Decision-DNNF. Furthermore, and counting, solution enumeration (solutions can be enumerated the sizes of the resulting compiled representations with polynomial delay), and optimization w.r.t. a linear turn out to be much smaller (sometimes by objective function. It also offers tractable transformations, several orders of magnitude).
Knowledge Compilation for Model Counting: Affine Decision Trees
Koriche, Frédéric (CRIL-CNRS and Université d'Artois) | Lagniez, Jean-Marie (FMV, Johannes Kepler University) | Marquis, Pierre (CRIL-CNRS and Université d'Artois) | Thomas, Samuel (CRIL-CNRS and Université d'Artois)
Counting the models of a propositional formula is a key issue for a number of AI problems, but few propositional languages offer the possibility to count models efficiently. In order to fill the gap, we introduce the language EADT of (extended) affine decision trees. An extended affine decision tree simply is a tree with affine decision nodes and some specific decomposable conjunction or disjunction nodes. Unlike standard decision trees, the decision nodes of an EADT formula are not labeled by variables but by affine clauses. We study EADT, and several subsets of it along the lines of the knowledge compilation map. We also describe a CNF-to-EADT compiler and present some experimental results. Those results show that the EADT compilation-based approach is competitive with (and in some cases is able to outperform) the model counter Cachet and the d-DNNF compilation-based approach to model counting.