Goto

Collaborating Authors

 Majumdar, Somshubra


Scoring Verifiers: Evaluating Synthetic Verification in Code and Reasoning

arXiv.org Artificial Intelligence

Code verification has recently found great success as a critical component in training large scale reasoning models for coding. Synthetic techniques such as self-generated test cases and reward models provide a way to enhance code capabilities beyond predefined tests. Building on these advancements, we propose new benchmarks designed to systematically evaluate the impact of synthetic verification methods on assessing solution correctness. We introduce HE-R, HE-R+, MBPP-R, and MBPP-R+, which transform existing coding benchmarks into scoring and ranking datasets to evaluate the effectiveness of synthetic verifiers. Using these benchmarks, we analyze synthetic verification methods in standard, reasoning-based, and reward-based LLMs. Our results show that recent reasoning models significantly improve test case generation and that scaling test cases enhances verification accuracy.


Less is More: Accurate Speech Recognition & Translation without Web-Scale Data

arXiv.org Artificial Intelligence

It was observed in [6] that such long utterances harm the model convergence. We also note that this Recent advances in speech recognition and translation rely on approach may lead to significant padding in mini-batches, resulting hundreds of thousands of hours of Internet speech data. We argue in wasted computation on non-informative frames. We that state-of-the art accuracy can be reached without relying on present an alternative approach to sampling and batching that web-scale data. Canary - multilingual ASR and speech translation allows us to iterate through data twice as fast, while balancing model, outperforms current state-of-the-art models - Whisper, different languages and data sources better. We further accelerate OWSM, and Seamless-M4T on English, French, Spanish, and the training and inference by adopting a FastConformer [7] architecture German languages, while being trained on an order of magnitude and initializing the encoder from a ASR only pretrained less data than these models. Three key factors enables such dataefficient checkpoint.


Instruction Data Generation and Unsupervised Adaptation for Speech Language Models

arXiv.org Artificial Intelligence

In this paper, we propose three methods for generating synthetic samples to train and evaluate multimodal large language models capable of processing both text and speech inputs. Addressing the scarcity of samples containing both modalities, synthetic data generation emerges as a crucial strategy to enhance the performance of such systems and facilitate the modeling of cross-modal relationships between the speech and text domains. Our process employs large language models to generate textual components and text-to-speech systems to generate speech components. The proposed methods offer a practical and effective means to expand the training dataset for these models. Experimental results show progress in achieving an integrated understanding of text and speech. We also highlight the potential of using unlabeled speech data to generate synthetic samples comparable in quality to those with available transcriptions, enabling the expansion of these models to more languages.


Nemotron-4 340B Technical Report

arXiv.org Artificial Intelligence

We release the Nemotron-4 340B model family, including Nemotron-4-340B-Base, Nemotron-4-340B-Instruct, and Nemotron-4-340B-Reward. Our models are open access under the NVIDIA Open Model License Agreement, a permissive model license that allows distribution, modification, and use of the models and its outputs. These models perform competitively to open access models on a wide range of evaluation benchmarks, and were sized to fit on a single DGX H100 with 8 GPUs when deployed in FP8 precision. We believe that the community can benefit from these models in various research studies and commercial applications, especially for generating synthetic data to train smaller language models. Notably, over 98% of data used in our model alignment process is synthetically generated, showcasing the effectiveness of these models in generating synthetic data. To further support open research and facilitate model development, we are also open-sourcing the synthetic data generation pipeline used in our model alignment process.


Stateful Conformer with Cache-based Inference for Streaming Automatic Speech Recognition

arXiv.org Artificial Intelligence

In this paper, we propose an efficient and accurate streaming speech recognition model based on the FastConformer architecture. We adapted the FastConformer architecture for streaming applications through: (1) constraining both the look-ahead and past contexts in the encoder, and (2) introducing an activation caching mechanism to enable the non-autoregressive encoder to operate autoregressively during inference. The proposed model is thoughtfully designed in a way to eliminate the accuracy disparity between the train and inference time which is common for many streaming models. Furthermore, our proposed encoder works with various decoder configurations including Connectionist Temporal Classification (CTC) and RNN-Transducer (RNNT) decoders. Additionally, we introduced a hybrid CTC/RNNT architecture which utilizes a shared encoder with both a CTC and RNNT decoder to boost the accuracy and save computation. We evaluate the proposed model on LibriSpeech dataset and a multi-domain large scale dataset and demonstrate that it can achieve better accuracy with lower latency and inference time compared to a conventional buffered streaming model baseline. We also showed that training a model with multiple latencies can achieve better accuracy than single latency models while it enables us to support multiple latencies with a single model. Our experiments also showed the hybrid architecture would not only speedup the convergence of the CTC decoder but also improves the accuracy of streaming models compared to single decoder models.


Efficient Sequence Transduction by Jointly Predicting Tokens and Durations

arXiv.org Artificial Intelligence

This paper introduces a novel Token-and-Duration Transducer (TDT) architecture for sequence-to-sequence tasks. TDT extends conventional RNN-Transducer architectures by jointly predicting both a token and its duration, i.e. the number of input frames covered by the emitted token. This is achieved by using a joint network with two outputs which are independently normalized to generate distributions over tokens and durations. During inference, TDT models can skip input frames guided by the predicted duration output, which makes them significantly faster than conventional Transducers which process the encoder output frame by frame. TDT models achieve both better accuracy and significantly faster inference than conventional Transducers on different sequence transduction tasks. TDT models for Speech Recognition achieve better accuracy and up to 2.82X faster inference than conventional Transducers. TDT models for Speech Translation achieve an absolute gain of over 1 BLEU on the MUST-C test compared with conventional Transducers, and its inference is 2.27X faster. In Speech Intent Classification and Slot Filling tasks, TDT models improve the intent accuracy by up to over 1% (absolute) over conventional Transducers, while running up to 1.28X faster. Our implementation of the TDT model will be open-sourced with the NeMo (https://github.com/NVIDIA/NeMo) toolkit.


Insights into LSTM Fully Convolutional Networks for Time Series Classification

arXiv.org Machine Learning

Long Short Term Memory Fully Convolutional Neural Networks (LSTM-FCN) and Attention LSTM-FCN (ALSTM-FCN) have shown to achieve state-of-the-art performance on the task of classifying time series signals on the old University of California-Riverside (UCR) time series repository. However, there has been no study on why LSTM-FCN and ALSTM-FCN perform well. In this paper, we perform a series of ablation tests (3627 experiments) on LSTM-FCN and ALSTM-FCN to provide a better understanding of the model and each of its sub-module. Results from the ablation tests on ALSTM-FCN and LSTM-FCN show that the these blocks perform better when applied in a conjoined manner. Two z-normalizing techniques, z-normalizing each sample independently and z-normalizing the whole dataset, are compared using a Wilcoxson signed-rank test to show a statistical difference in performance. In addition, we provide an understanding of the impact dimension shuffle has on LSTM-FCN by comparing its performance with LSTM-FCN when no dimension shuffle is applied. Finally, we demonstrate the performance of the LSTM-FCN when the LSTM block is replaced by a GRU, basic RNN, and Dense Block.


Adversarial Attacks on Time Series

arXiv.org Machine Learning

Time series classification models have been garnering significant importance in the research community. However, not much research has been done on generating adversarial samples for these models. These adversarial samples can become a security concern. In this paper, we propose utilizing an adversarial transformation network (ATN) on a distilled model to attack various time series classification models. The proposed attack on the classification model utilizes a distilled model as a surrogate that mimics the behavior of the attacked classical time series classification models. Our proposed methodology is applied onto 1-Nearest Neighbor Dynamic Time Warping (1-NN ) DTW, a Fully Connected Network and a Fully Convolutional Network (FCN), all of which are trained on 42 University of California Riverside (UCR) datasets. In this paper, we show both models were susceptible to attacks on all 42 datasets. To the best of our knowledge, such an attack on time series classification models has never been done before. Finally, we recommend future researchers that develop time series classification models to incorporating adversarial data samples into their training data sets to improve resilience on adversarial samples and to consider model robustness as an evaluative metric.


Pathological Voice Classification Using Mel-Cepstrum Vectors and Support Vector Machine

arXiv.org Machine Learning

Abstract--Vocal disorders have affected several patients all over the world. Due to the inherent difficulty of diagnosing vocal disorders without sophisticated equipment and trained personnel, a number of patients remain undiagnosed. To alleviate the monetary cost of diagnosis, there has been a recent growth in the use of data analysis to accurately detect and diagnose individuals for a fraction of the cost. We propose a cheap, efficient and accurate model to diagnose whether a patient suffers from one of three vocal disorders on the FEMH 2018 challenge. I. INTRODUCTION The human standard of life can be severely affected by their individual pathological voice condition.


Multivariate LSTM-FCNs for Time Series Classification

arXiv.org Machine Learning

Over the past decade, multivariate time series classification has been receiving a lot of attention. We propose augmenting the existing univariate time series classification models, LSTM-FCN and ALSTM-FCN with a squeeze and excitation block to further improve performance. Our proposed models outperform most of the state of the art models while requiring minimum preprocessing. The proposed models work efficiently on various complex multivariate time series classification tasks such as activity recognition or action recognition. Furthermore, the proposed models are highly efficient at test time and small enough to deploy on memory constrained systems.