bhattacharyya
Reconsidering SMT Over NMT for Closely Related Languages: A Case Study of Persian-Hindi Pair
Yousofi, Waisullah, Bhattacharyya, Pushpak
This paper demonstrates that Phrase-Based Statistical Machine Translation (PBSMT) can outperform Transformer-based Neural Machine Translation (NMT) in moderate-resource scenarios, specifically for structurally similar languages, like the Persian-Hindi pair. Despite the Transformer architecture's typical preference for large parallel corpora, our results show that PBSMT achieves a BLEU score of 66.32, significantly exceeding the Transformer-NMT score of 53.7 on the same dataset. Additionally, we explore variations of the SMT architecture, including training on Romanized text and modifying the word order of Persian sentences to match the left-to-right (LTR) structure of Hindi. Our findings highlight the importance of choosing the right architecture based on language pair characteristics and advocate for SMT as a high-performing alternative, even in contexts commonly dominated by NMT.
- Oceania > Australia > Victoria > Melbourne (0.04)
- North America > United States > Colorado > Denver County > Denver (0.04)
- Europe > Lithuania (0.04)
- (9 more...)
Helios: An extremely low power event-based gesture recognition for always-on smart eyewear
Bhattacharyya, Prarthana, Mitton, Joshua, Page, Ryan, Morgan, Owen, Menzies, Ben, Homewood, Gabriel, Jacobs, Kemi, Baesso, Paolo, Trickett, Dave, Mair, Chris, Muhonen, Taru, Clark, Rory, Berridge, Louis, Vigars, Richard, Wallace, Iain
This paper introduces Helios, the first extremely low-power, real-time, event-based hand gesture recognition system designed for all-day on smart eyewear. As augmented reality (AR) evolves, current smart glasses like the Meta Ray-Bans prioritize visual and wearable comfort at the expense of functionality. Existing human-machine interfaces (HMIs) in these devices, such as capacitive touch and voice controls, present limitations in ergonomics, privacy and power consumption. Helios addresses these challenges by leveraging natural hand interactions for a more intuitive and comfortable user experience. Our system utilizes a extremely low-power and compact 3mmx4mm/20mW event camera to perform natural hand-based gesture recognition for always-on smart eyewear. The camera's output is processed by a convolutional neural network (CNN) running on a NXP Nano UltraLite compute platform, consuming less than 350mW. Helios can recognize seven classes of gestures, including subtle microgestures like swipes and pinches, with 91% accuracy. We also demonstrate real-time performance across 20 users at a remarkably low latency of 60ms. Our user testing results align with the positive feedback we received during our recent successful demo at AWE-USA-2024.
- Europe > Switzerland > Zürich > Zürich (0.14)
- North America > United States > Hawaii > Honolulu County > Honolulu (0.04)
- North America > United States > Louisiana > Orleans Parish > New Orleans (0.04)
- (4 more...)
Autoregressive Score Generation for Multi-trait Essay Scoring
Do, Heejin, Kim, Yunsu, Lee, Gary Geunbae
Recently, encoder-only pre-trained models such as BERT have been successfully applied in automated essay scoring (AES) to predict a single overall score. However, studies have yet to explore these models in multi-trait AES, possibly due to the inefficiency of replicating BERT-based models for each trait. Breaking away from the existing sole use of encoder, we propose an autoregressive prediction of multi-trait scores (ArTS), incorporating a decoding process by leveraging the pre-trained T5. Unlike prior regression or classification methods, we redefine AES as a score-generation task, allowing a single model to predict multiple scores. During decoding, the subsequent trait prediction can benefit by conditioning on the preceding trait scores. Experimental results proved the efficacy of ArTS, showing over 5% average improvements in both prompts and traits.
- Oceania > Australia > Victoria > Melbourne (0.04)
- North America > United States > California > Santa Clara County > Los Gatos (0.04)
- Europe > Ireland > Leinster > County Dublin > Dublin (0.04)
- Education > Assessment & Standards > Student Performance (0.95)
- Education > Educational Setting (0.90)
- Education > Educational Technology > Educational Software > Computer-Aided Assessment (0.50)
IndicTrans2: Towards High-Quality and Accessible Machine Translation Models for all 22 Scheduled Indian Languages
Gala, Jay, Chitale, Pranjal A., AK, Raghavan, Gumma, Varun, Doddapaneni, Sumanth, Kumar, Aswanth, Nawale, Janki, Sujatha, Anupama, Puduppully, Ratish, Raghavan, Vivek, Kumar, Pratyush, Khapra, Mitesh M., Dabre, Raj, Kunchukuttan, Anoop
India has a rich linguistic landscape with languages from 4 major language families spoken by over a billion people. 22 of these languages are listed in the Constitution of India (referred to as scheduled languages) are the focus of this work. Given the linguistic diversity, high-quality and accessible Machine Translation (MT) systems are essential in a country like India. Prior to this work, there was (i) no parallel training data spanning all 22 languages, (ii) no robust benchmarks covering all these languages and containing content relevant to India, and (iii) no existing translation models which support all the 22 scheduled languages of India. In this work, we aim to address this gap by focusing on the missing pieces required for enabling wide, easy, and open access to good machine translation systems for all 22 scheduled Indian languages. We identify four key areas of improvement: curating and creating larger training datasets, creating diverse and high-quality benchmarks, training multilingual models, and releasing models with open access. Our first contribution is the release of the Bharat Parallel Corpus Collection (BPCC), the largest publicly available parallel corpora for Indic languages. BPCC contains a total of 230M bitext pairs, of which a total of 126M were newly added, including 644K manually translated sentence pairs created as part of this work. Our second contribution is the release of the first n-way parallel benchmark covering all 22 Indian languages, featuring diverse domains, Indian-origin content, and source-original test sets. Next, we present IndicTrans2, the first model to support all 22 languages, surpassing existing models on multiple existing and new benchmarks created as a part of this work. Lastly, to promote accessibility and collaboration, we release our models and associated data with permissive licenses at https://github.com/AI4Bharat/IndicTrans2.
- North America > United States > Minnesota > Hennepin County > Minneapolis (0.27)
- North America > Canada > British Columbia > Metro Vancouver Regional District > Vancouver (0.13)
- Europe > Ireland > Leinster > County Dublin > Dublin (0.04)
- (45 more...)
- Research Report > New Finding (1.00)
- Overview (1.00)
- Law (1.00)
- Education (1.00)
- Government > Regional Government > Asia Government > India Government (0.66)
- Consumer Products & Services > Travel (0.45)
APE-then-QE: Correcting then Filtering Pseudo Parallel Corpora for MT Training Data Creation
Batheja, Akshay, Deoghare, Sourabh, Kanojia, Diptesh, Bhattacharyya, Pushpak
Automatic Post-Editing (APE) is the task of automatically identifying and correcting errors in the Machine Translation (MT) outputs. We propose a repair-filter-use methodology that uses an APE system to correct errors on the target side of the MT training data. We select the sentence pairs from the original and corrected sentence pairs based on the quality scores computed using a Quality Estimation (QE) model. To the best of our knowledge, this is a novel adaptation of APE and QE to extract quality parallel corpus from the pseudo-parallel corpus. By training with this filtered corpus, we observe an improvement in the Machine Translation system's performance by 5.64 and 9.91 BLEU points, for English-Marathi and Marathi-English, over the baseline model. The baseline model is the one that is trained on the whole pseudo-parallel corpus. Our work is not limited by the characteristics of English or Marathi languages; and is language pair-agnostic, given the necessary QE and APE data.
- Asia > Middle East > UAE > Abu Dhabi Emirate > Abu Dhabi (0.05)
- Asia > India (0.04)
- North America > United States > Washington > King County > Seattle (0.04)
- (6 more...)
"A Little is Enough": Few-Shot Quality Estimation based Corpus Filtering improves Machine Translation
Batheja, Akshay, Bhattacharyya, Pushpak
Quality Estimation (QE) is the task of evaluating the quality of a translation when reference translation is not available. The goal of QE aligns with the task of corpus filtering, where we assign the quality score to the sentence pairs present in the pseudo-parallel corpus. We propose a Quality Estimation based Filtering approach to extract high-quality parallel data from the pseudo-parallel corpus. To the best of our knowledge, this is a novel adaptation of the QE framework to extract quality parallel corpus from the pseudo-parallel corpus. By training with this filtered corpus, we observe an improvement in the Machine Translation (MT) system's performance by up to 1.8 BLEU points, for English-Marathi, Chinese-English, and Hindi-Bengali language pairs, over the baseline model. The baseline model is the one that is trained on the whole pseudo-parallel corpus. Our Few-shot QE model transfer learned from the English-Marathi QE model and fine-tuned on only 500 Hindi-Bengali training instances, shows an improvement of up to 0.6 BLEU points for Hindi-Bengali language pair, compared to the baseline model. This demonstrates the promise of transfer learning in the setting under discussion. QE systems typically require in the order of (7K-25K) of training data. Our Hindi-Bengali QE is trained on only 500 instances of training that is 1/40th of the normal requirement and achieves comparable performance. All the scripts and datasets utilized in this study will be publicly available.
Prompt- and Trait Relation-aware Cross-prompt Essay Trait Scoring
Do, Heejin, Kim, Yunsu, Lee, Gary Geunbae
Automated essay scoring (AES) aims to score essays written for a given prompt, which defines the writing topic. Most existing AES systems assume to grade essays of the same prompt as used in training and assign only a holistic score. However, such settings conflict with real-education situations; pre-graded essays for a particular prompt are lacking, and detailed trait scores of sub-rubrics are required. Thus, predicting various trait scores of unseen-prompt essays (called cross-prompt essay trait scoring) is a remaining challenge of AES. In this paper, we propose a robust model: prompt- and trait relation-aware cross-prompt essay trait scorer. We encode prompt-aware essay representation by essay-prompt attention and utilizing the topic-coherence feature extracted by the topic-modeling mechanism without access to labeled data; therefore, our model considers the prompt adherence of an essay, even in a cross-prompt setting. To facilitate multi-trait scoring, we design trait-similarity loss that encapsulates the correlations of traits. Experiments prove the efficacy of our model, showing state-of-the-art results for all prompts and traits. Significant improvements in low-resource-prompt and inferior traits further indicate our model's strength.
- North America > United States > New York (0.04)
- Asia > Middle East > Jordan (0.04)
Denoising-based UNMT is more robust to word-order divergence than MASS-based UNMT
Banerjee, Tamali, Murthy, Rudra V, Bhattacharyya, Pushpak
We aim to investigate whether UNMT approaches with self-supervised pre-training are robust to word-order divergence between language pairs. We achieve this by comparing two models pre-trained with the same self-supervised pre-training objective. The first model is trained on language pairs with different word-orders, and the second model is trained on the same language pairs with source language re-ordered to match the word-order of the target language. Ideally, UNMT approaches which are robust to word-order divergence should exhibit no visible performance difference between the two configurations. In this paper, we investigate two such self-supervised pre-training based UNMT approaches, namely Masked Sequence-to-Sequence Pre-Training, (MASS) (which does not have shuffling noise) and Denoising AutoEncoder (DAE), (which has shuffling noise). We experiment with five English$\rightarrow$Indic language pairs, i.e., en-hi, en-bn, en-gu, en-kn, and en-ta) where word-order of the source language is SVO (Subject-Verb-Object), and the word-order of the target languages is SOV (Subject-Object-Verb). We observed that for these language pairs, DAE-based UNMT approach consistently outperforms MASS in terms of translation accuracies. Moreover, bridging the word-order gap using reordering improves the translation accuracy of MASS-based UNMT models, while it cannot improve the translation accuracy of DAE-based UNMT models. This observation indicates that DAE-based UNMT is more robust to word-order divergence than MASS-based UNMT. Word-shuffling noise in DAE approach could be the possible reason for the approach being robust to word-order divergence.
- Asia > India (0.06)
- Europe > Belgium > Brussels-Capital Region > Brussels (0.04)
- Europe > Ireland > Leinster > County Dublin > Dublin (0.04)
- (2 more...)
Scientists use AI to identify nature of thousands of new cosmic objects
New Delhi: Scientists have used machine learning, a variant of artificial intelligence (AI), to identify the nature of thousands of new cosmic objects such as stars, black holes and pulsars. The researchers at Tata Institute of Fundamental Research (TIFR), Mumbai, and Indian Institute of Space Science and Technology (IIST), Thiruvananthapuram, applied machine learning techniques to hundreds of thousands of space objects observed in X-ray wavelengths (0.03 and 3 nanometres) with NASA's Chandra space observatory. The study, published in the journal Monthly Notices of the Royal Astronomical Society, applied the technique to about 2,77,000 X-ray objects, the nature of most of which was unknown. A classification of the nature of unknown objects is equivalent to the discovery of objects of specific classes, the researchers said. This research has thus led to a reliable discovery of many thousands of cosmic objects of classes, such as black holes, neutron stars, white dwarfs, stars, etc, and opened up an enormous opportunity for the astronomy community for further detailed studies of many interesting new objects, they said.
Concentration inequalities for correlated network-valued processes with applications to community estimation and changepoint analysis
Chatterjee, Sayak, Chatterjee, Shirshendu, Mukherjee, Soumendu Sundar, Nath, Anirban, Bhattacharyya, Sharmodeep
Network-valued time series are currently a common form of network data. However, the study of the aggregate behavior of network sequences generated from network-valued stochastic processes is relatively rare. Most of the existing research focuses on the simple setup where the networks are independent (or conditionally independent) across time, and all edges are updated synchronously at each time step. In this paper, we study the concentration properties of the aggregated adjacency matrix and the corresponding Laplacian matrix associated with network sequences generated from lazy network-valued stochastic processes, where edges update asynchronously, and each edge follows a lazy stochastic process for its updates independent of the other edges. We demonstrate the usefulness of these concentration results in proving consistency of standard estimators in community estimation and changepoint estimation problems. We also conduct a simulation study to demonstrate the effect of the laziness parameter, which controls the extent of temporal correlation, on the accuracy of community and changepoint estimation.
- Asia > India > West Bengal > Kolkata (0.04)
- North America > United States > Oregon (0.04)
- North America > United States > New York (0.04)
- (3 more...)
- Information Technology > Data Science (1.00)
- Information Technology > Communications (1.00)
- Information Technology > Artificial Intelligence > Representation & Reasoning > Mathematical & Statistical Methods (0.74)
- Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.67)