Maheshwari, Ayush
ARISE: Iterative Rule Induction and Synthetic Data Generation for Text Classification
M., Yashwanth, Singh, Vaibhav, Maheshwari, Ayush, Krishna, Amrith, Ramakrishnan, Ganesh
We propose ARISE, a framework that iteratively induces rules and generates synthetic data for text classification. We combine synthetic data generation and automatic rule induction, via bootstrapping, to iteratively filter the generated rules and data. We induce rules via inductive generalisation of syntactic n-grams, enabling us to capture a complementary source of supervision. These rules alone lead to performance gains in both, in-context learning (ICL) and fine-tuning (FT) settings. Similarly, use of augmented data from ARISE alone improves the performance for a model, outperforming configurations that rely on complex methods like contrastive learning. Further, our extensive experiments on various datasets covering three full-shot, eight few-shot and seven multilingual variant settings demonstrate that the rules and data we generate lead to performance improvements across these diverse domains and languages.
Enhancing Low-Resource NMT with a Multilingual Encoder and Knowledge Distillation: A Case Study
Roy, Aniruddha, Ray, Pretam, Maheshwari, Ayush, Sarkar, Sudeshna, Goyal, Pawan
Neural Machine Translation (NMT) remains a formidable challenge, especially when dealing with low-resource languages. Pre-trained sequence-to-sequence (seq2seq) multi-lingual models, such as mBART-50, have demonstrated impressive performance in various low-resource NMT tasks. However, their pre-training has been confined to 50 languages, leaving out support for numerous low-resource languages, particularly those spoken in the Indian subcontinent. Expanding mBART-50's language support requires complex pre-training, risking performance decline due to catastrophic forgetting. Considering these expanding challenges, this paper explores a framework that leverages the benefits of a pre-trained language model along with knowledge distillation in a seq2seq architecture to facilitate translation for low-resource languages, including those not covered by mBART-50. The proposed framework employs a multilingual encoder-based seq2seq model as the foundational architecture and subsequently uses complementary knowledge distillation techniques to mitigate the impact of imbalanced training. Our framework is evaluated on three low-resource Indic languages in four Indic-to-Indic directions, yielding significant BLEU-4 and chrF improvements over baselines. Further, we conduct human evaluation to confirm effectiveness of our approach. Our code is publicly available at https://github.com/raypretam/Two-step-low-res-NMT.
FAIR: Filtering of Automatically Induced Rules
Bajpai, Divya Jyoti, Maheshwari, Ayush, Hanawal, Manjesh Kumar, Ramakrishnan, Ganesh
The availability of large annotated data can be a critical bottleneck in training machine learning algorithms successfully, especially when applied to diverse domains. Weak supervision offers a promising alternative by accelerating the creation of labeled training data using domain-specific rules. However, it requires users to write a diverse set of high-quality rules to assign labels to the unlabeled data. Automatic Rule Induction (ARI) approaches circumvent this problem by automatically creating rules from features on a small labeled set and filtering a final set of rules from them. In the ARI approach, the crucial step is to filter out a set of a high-quality useful subset of rules from the large set of automatically created rules. In this paper, we propose an algorithm (Filtering of Automatically Induced Rules) to filter rules from a large number of automatically induced rules using submodular objective functions that account for the collective precision, coverage, and conflicts of the rule set. We experiment with three ARI approaches and five text classification datasets to validate the superior performance of our algorithm with respect to several semi-supervised label aggregation approaches. Further, we show that achieves statistically significant results in comparison to existing rule-filtering approaches.
LexGen: Domain-aware Multilingual Lexicon Generation
NJ, Karthika, Maheshwari, Ayush, Singh, Atul Kumar, Jyothi, Preethi, Ramakrishnan, Ganesh, Bhatt, Krishnakant
Lexicon or dictionary generation across domains is of significant societal importance, as it can potentially enhance information accessibility for a diverse user base while preserving language identity. Prior work in the field primarily focuses on bilingual lexical induction, which deals with word alignments using mapping-based or corpora-based approaches. Though initiated by researchers, the research associated with lexicon generation is limited, even more so with domain-specific lexicons. This task becomes particularly important in atypical medical, engineering, and other technical domains, owing to the highly infrequent usage of the terms and negligibly low data availability of technical terms in many low-resource languages. Owing to the research gap in lexicon generation, especially with a limited focus on the domain-specific area, we propose a new model to generate dictionary words for 6 Indian languages in the multi-domain setting. Our model consists of domain-specific and domain-generic layers that encode information, and these layers are invoked via a learnable routing technique. Further, we propose an approach to explicitly leverage the relatedness between these Indian languages toward coherent translation. We also release a new benchmark dataset across 6 Indian languages that span 8 diverse domains that can propel further research in domain-specific lexicon induction. We conduct both zero-shot and few-shot experiments across multiple domains to show the efficacy of our proposed model in generalizing to unseen domains and unseen languages.
S\={a}mayik: A Benchmark and Dataset for English-Sanskrit Translation
Maheshwari, Ayush, Gupta, Ashim, Krishna, Amrith, Ramakrishnan, Ganesh, Kumar, G. Anil, Singla, Jitin
Sanskrit is a low-resource language with a rich heritage. Digitized Sanskrit corpora reflective of the contemporary usage of Sanskrit, specifically that too in prose, is heavily under-represented at present. Presently, no such English-Sanskrit parallel dataset is publicly available. We release a dataset, S\={a}mayik, of more than 42,000 parallel English-Sanskrit sentences, from four different corpora that aim to bridge this gap. Moreover, we also release benchmarks adapted from existing multilingual pretrained models for Sanskrit-English translation. We include training splits from our contemporary dataset and the Sanskrit-English parallel sentences from the training split of Itih\={a}sa, a previously released classical era machine translation dataset containing Sanskrit.
DICTDIS: Dictionary Constrained Disambiguation for Improved NMT
Maheshwari, Ayush, Sharma, Piyush, Jyothi, Preethi, Ramakrishnan, Ganesh
Domain-specific neural machine translation (NMT) systems (\eg, in educational applications) are socially significant with the potential to help make information accessible to a diverse set of users in multilingual societies. It is desirable that such NMT systems be lexically constrained and draw from domain-specific dictionaries. Dictionaries could present multiple candidate translations for a source word/phrase due to the polysemous nature of words. The onus is then on the NMT model to choose the contextually most appropriate candidate. Prior work has largely ignored this problem and focused on the single candidate constraint setting wherein the target word or phrase is replaced by a single constraint. In this work we present \dictdis, a lexically constrained NMT system that disambiguates between multiple candidate translations derived from dictionaries. We achieve this by augmenting training data with multiple dictionary candidates to actively encourage disambiguation during training by implicitly aligning multiple candidate constraints. We demonstrate the utility of \dictdis\ via extensive experiments on English-Hindi and English-German sentences in a variety of domains including regulatory, finance, engineering. We also present comparisons on standard benchmark test datasets. In comparison with existing approaches for lexically constrained and unconstrained NMT, we demonstrate superior performance with respect to constraint copy and disambiguation related measures on all domains while also obtaining improved fluency of up to 2-3 BLEU points on some domains.
Adaptive Mixing of Auxiliary Losses in Supervised Learning
Sivasubramanian, Durga, Maheshwari, Ayush, Shenoy, Pradeep, AP, Prathosh, Ramakrishnan, Ganesh
In several supervised learning scenarios, auxiliary losses are used in order to introduce additional information or constraints into the supervised learning objective. For instance, knowledge distillation aims to mimic outputs of a powerful teacher model; similarly, in rule-based approaches, weak labeling information is provided by labeling functions which may be noisy rule-based approximations to true labels. We tackle the problem of learning to combine these losses in a principled manner. Our proposal, AMAL, uses a bi-level optimization criterion on validation data to learn optimal mixing weights, at an instance level, over the training data. We describe a meta-learning approach towards solving this bi-level objective and show how it can be applied to different scenarios in supervised learning. Experiments in a number of knowledge distillation and rule-denoising domains show that AMAL provides noticeable gains over competitive baselines in those domains. We empirically analyze our method and share insights into the mechanisms through which it provides performance gains.
UDAAN: Machine Learning based Post-Editing tool for Document Translation
Maheshwari, Ayush, Ravindran, Ajay, Subramanian, Venkatapathy, Ramakrishnan, Ganesh
We introduce UDAAN, an open-source post-editing tool that can reduce manual editing efforts to quickly produce publishable-standard documents in several Indic languages. UDAAN has an end-to-end Machine Translation (MT) plus post-editing pipeline wherein users can upload a document to obtain raw MT output. Further, users can edit the raw translations using our tool. UDAAN offers several advantages: a) Domain-aware, vocabulary-based lexical constrained MT. b) source-target and target-target lexicon suggestions for users. Replacements are based on the source and target texts lexicon alignment. c) Translation suggestions are based on logs created during user interaction. d) Source-target sentence alignment visualisation that reduces the cognitive load of users during editing. e) Translated outputs from our tool are available in multiple formats: docs, latex, and PDF. We also provide the facility to use around 100 in-domain dictionaries for lexicon-aware machine translation. Although we limit our experiments to English-to-Hindi translation, our tool is independent of the source and target languages. Experimental results based on the usage of the tools and users feedback show that our tool speeds up the translation time by approximately a factor of three compared to the baseline method of translating documents from scratch. Our tool is available for both Windows and Linux platforms. The tool is open-source under MIT license, and the source code can be accessed from our website at https://www.udaanproject.org. Demonstration and tutorial videos for various features of our tool can be accessed at https://www.youtube.com/channel/UClfK7iC8J7b22bj3GwAUaCw. Our MT pipeline can be accessed at https://udaaniitb.aicte-india.org/udaan/translate/.
Data Programming using Semi-Supervision and Subset Selection
Maheshwari, Ayush, Chatterjee, Oishik, Killamsetty, KrishnaTeja, Iyer, Rishabh, Ramakrishnan, Ganesh
The paradigm of data programming~\cite{bach2019snorkel} has shown a lot of promise in using weak supervision in the form of rules and labelling functions to learn in scenarios where labelled data is not available. Another approach which has shown a lot of promise is that of semi-supervised learning where we augment small amounts of labelled data with a large unlabelled dataset. In this work, we argue that by not using any labelled data, data programming based approaches can yield sub-optimal performance, particularly, in cases when the labelling functions are noisy. The first contribution of this work is to study a framework of joint learning which combines un-supervised consensus from labelling functions with semi-supervised learning and \emph{jointly learns a model} to efficiently use the rules/labelling functions along with semi-supervised loss functions on the feature space. Next, we also study a subset selection approach to \emph{select} the set of examples which can be used as the labelled set. We evaluate our techniques on synthetic data as well as four publicly available datasets and show improvement over state-of-the-art techniques\footnote{Source code of the paper at \url{https://github.com/ayushbits/Semi-Supervised-LFs-Subset-Selection}}.