Hrinchuk, Oleksii
Anticipating Future with Large Language Model for Simultaneous Machine Translation
Ouyang, Siqi, Hrinchuk, Oleksii, Chen, Zhehuai, Lavrukhin, Vitaly, Balam, Jagadeesh, Li, Lei, Ginsburg, Boris
Simultaneous machine translation (SMT) takes streaming input utterances and incrementally produces target text. Existing SMT methods only use the partial utterance that has already arrived at the input and the generated hypothesis. Motivated by human interpreters' technique to forecast future words before hearing them, we propose $\textbf{T}$ranslation by $\textbf{A}$nticipating $\textbf{F}$uture (TAF), a method to improve translation quality while retraining low latency. Its core idea is to use a large language model (LLM) to predict future source words and opportunistically translate without introducing too much risk. We evaluate our TAF and multiple baselines of SMT on four language directions. Experiments show that TAF achieves the best translation quality-latency trade-off and outperforms the baselines by up to 5 BLEU points at the same latency (three words).
EMMeTT: Efficient Multimodal Machine Translation Training
Żelasko, Piotr, Chen, Zhehuai, Wang, Mengru, Galvez, Daniel, Hrinchuk, Oleksii, Ding, Shuoyang, Hu, Ke, Balam, Jagadeesh, Lavrukhin, Vitaly, Ginsburg, Boris
A rising interest in the modality extension of foundation language models warrants discussion on the most effective, and efficient, multimodal training approach. This work focuses on neural machine translation (NMT) and proposes a joint multimodal training regime of Speech-LLM to include automatic speech translation (AST). We investigate two different foundation model architectures, decoder-only GPT and encoder-decoder T5, extended with Canary-1B's speech encoder. To handle joint multimodal training, we propose a novel training framework called EMMeTT. EMMeTT improves training efficiency with the following: balanced sampling across languages, datasets, and modalities; efficient sequential data iteration; and a novel 2D bucketing scheme for multimodal data, complemented by a batch size optimizer (OOMptimizer). We show that a multimodal training consistently helps with both architectures. Moreover, SALM-T5 trained with EMMeTT retains the original NMT capability while outperforming AST baselines on four-language subsets of FLORES and FLEURS. The resultant Multimodal Translation Model produces strong text and speech translation results at the same time.
Less is More: Accurate Speech Recognition & Translation without Web-Scale Data
Puvvada, Krishna C., Żelasko, Piotr, Huang, He, Hrinchuk, Oleksii, Koluguri, Nithin Rao, Dhawan, Kunal, Majumdar, Somshubra, Rastorgueva, Elena, Chen, Zhehuai, Lavrukhin, Vitaly, Balam, Jagadeesh, Ginsburg, Boris
It was observed in [6] that such long utterances harm the model convergence. We also note that this Recent advances in speech recognition and translation rely on approach may lead to significant padding in mini-batches, resulting hundreds of thousands of hours of Internet speech data. We argue in wasted computation on non-informative frames. We that state-of-the art accuracy can be reached without relying on present an alternative approach to sampling and batching that web-scale data. Canary - multilingual ASR and speech translation allows us to iterate through data twice as fast, while balancing model, outperforms current state-of-the-art models - Whisper, different languages and data sources better. We further accelerate OWSM, and Seamless-M4T on English, French, Spanish, and the training and inference by adopting a FastConformer [7] architecture German languages, while being trained on an order of magnitude and initializing the encoder from a ASR only pretrained less data than these models. Three key factors enables such dataefficient checkpoint.
SALM: Speech-augmented Language Model with In-context Learning for Speech Recognition and Translation
Chen, Zhehuai, Huang, He, Andrusenko, Andrei, Hrinchuk, Oleksii, Puvvada, Krishna C., Li, Jason, Ghosh, Subhankar, Balam, Jagadeesh, Ginsburg, Boris
We present a novel Speech Augmented Language Model (SALM) with {\em multitask} and {\em in-context} learning capabilities. SALM comprises a frozen text LLM, a audio encoder, a modality adapter module, and LoRA layers to accommodate speech input and associated task instructions. The unified SALM not only achieves performance on par with task-specific Conformer baselines for Automatic Speech Recognition (ASR) and Speech Translation (AST), but also exhibits zero-shot in-context learning capabilities, demonstrated through keyword-boosting task for ASR and AST. Moreover, {\em speech supervised in-context training} is proposed to bridge the gap between LLM training and downstream speech tasks, which further boosts the in-context learning ability of speech-to-text models. Proposed model is open-sourced via NeMo toolkit.
Leveraging Synthetic Targets for Machine Translation
Mittal, Sarthak, Hrinchuk, Oleksii, Kuchaiev, Oleksii
In this work, we provide a recipe for training machine translation models in a limited resource setting by leveraging synthetic target data generated using a large pre-trained model. We show that consistently across different benchmarks in bilingual, multilingual, and speech translation setups, training models on synthetic targets outperforms training on the actual ground-truth data. This performance gap grows bigger with increasing limits on the amount of available resources in the form of the size of the dataset and the number of parameters in the model. We also provide preliminary analysis into whether this boost in performance is linked to ease of optimization or more deterministic nature of the predictions, and whether this paradigm leads to better out-of-distribution performance across different testing domains.
Stochastic Gradient Methods with Layer-wise Adaptive Moments for Training of Deep Networks
Ginsburg, Boris, Castonguay, Patrice, Hrinchuk, Oleksii, Kuchaiev, Oleksii, Lavrukhin, Vitaly, Leary, Ryan, Li, Jason, Nguyen, Huyen, Cohen, Jonathan M.
We propose NovoGrad, a first-order stochastic gradient method with layer-wise gradient normalization via second moment estimators and with decoupled weight decay for a better regularization. The method requires half as much memory as Adam/AdamW. We evaluated NovoGrad on a diverse set of problems, including image classification, speech recognition, neural machine translation and language modeling. On these problems, NovoGrad performed equal to or better than SGD and Adam/AdamW. Empirically we show that NovoGrad (1) is very robust during the initial training phase and does not require learning rate warm-up, (2) works well with the same learning rate policy for different problems, and (3) generally performs better than other optimizers for very large batch sizes.
Catalyst.RL: A Distributed Framework for Reproducible RL Research
Kolesnikov, Sergey, Hrinchuk, Oleksii
Despite the recent progress in deep reinforcement learning field (RL), and, arguably because of it, a large body of work remains to be done in reproducing and carefully comparing different RL algorithms. We present catalyst.RL, an open source framework for RL research with a focus on reproducibility and flexibility. Main features of our library include large-scale asynchronous distributed training, easy-to-use configuration files with the complete list of hyperparameters for the particular experiments, efficient implementations of various RL algorithms and auxiliary tricks, such as frame stacking, n-step returns, value distributions, etc. To vindicate the usefulness of our framework, we evaluate it on a range of benchmarks in a continuous control, as well as on the task of developing a controller to enable a physiologically-based human model with a prosthetic leg to walk and run. The latter task was introduced at NeurIPS 2018 AI for Prosthetics Challenge, where our team took the 3rd place, capitalizing on the ability of catalyst.RL to train high-quality and sample-efficient RL agents.
Artificial Intelligence for Prosthetics - challenge solutions
Kidziński, Łukasz, Ong, Carmichael, Mohanty, Sharada Prasanna, Hicks, Jennifer, Carroll, Sean F., Zhou, Bo, Zeng, Hongsheng, Wang, Fan, Lian, Rongzhong, Tian, Hao, Jaśkowski, Wojciech, Andersen, Garrett, Lykkebø, Odd Rune, Toklu, Nihat Engin, Shyam, Pranav, Srivastava, Rupesh Kumar, Kolesnikov, Sergey, Hrinchuk, Oleksii, Pechenko, Anton, Ljungström, Mattias, Wang, Zhen, Hu, Xu, Hu, Zehong, Qiu, Minghui, Huang, Jun, Shpilman, Aleksei, Sosin, Ivan, Svidchenko, Oleg, Malysheva, Aleksandra, Kudenko, Daniel, Rane, Lance, Bhatt, Aditya, Wang, Zhengfei, Qi, Penghui, Yu, Zeyang, Peng, Peng, Yuan, Quan, Li, Wenxin, Tian, Yunsheng, Yang, Ruihan, Ma, Pingchuan, Khadka, Shauharda, Majumdar, Somdeb, Dwiel, Zach, Liu, Yinyin, Tumer, Evren, Watson, Jeremy, Salathé, Marcel, Levine, Sergey, Delp, Scott
In the NeurIPS 2018 Artificial Intelligence for Prosthetics challenge, participants were tasked with building a controller for a musculoskeletal model with a goal of matching a given time-varying velocity vector. Top participants were invited to describe their algorithms. In this work, we describe the challenge and present thirteen solutions that used deep reinforcement learning approaches. Many solutions use similar relaxations and heuristics, such as reward shaping, frame skipping, discretization of the action space, symmetry, and policy blending. However, each team implemented different modifications of the known algorithms by, for example, dividing the task into subtasks, learning low-level control, or by incorporating expert knowledge and using imitation learning.
Generalized Tensor Models for Recurrent Neural Networks
Khrulkov, Valentin, Hrinchuk, Oleksii, Oseledets, Ivan
Recurrent Neural Networks (RNNs) are very successful at solving challenging problems with sequential data. However, this observed efficiency is not yet entirely explained by theory. It is known that a certain class of multiplicative RNNs enjoys the property of depth efficiency --- a shallow network of exponentially large width is necessary to realize the same score function as computed by such an RNN. Such networks, however, are not very often applied to real life tasks. In this work, we attempt to reduce the gap between theory and practice by extending the theoretical analysis to RNNs which employ various nonlinearities, such as Rectified Linear Unit (ReLU), and show that they also benefit from properties of universality and depth efficiency. Our theoretical results are verified by a series of extensive computational experiments.