AITopics | Doshi, Meet

Collaborating Authors

Doshi, Meet

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

How effective is Multi-source pivoting for Translation of Low Resource Indian Languages?

Gaikwad, Pranav, Doshi, Meet, Dabre, Raj, Bhattacharyya, Pushpak

arXiv.org Artificial IntelligenceJun-19-2024

Machine Translation (MT) between linguistically dissimilar languages is challenging, especially due to the scarcity of parallel corpora. Prior works suggest that pivoting through a high-resource language can help translation into a related low-resource language. However, existing works tend to discard the source sentence when pivoting. Taking the case of English to Indian language MT, this paper explores the 'multi-source translation' approach with pivoting, using both source and pivot sentences to improve translation. We conducted extensive experiments with various multi-source techniques for translating English to Konkani, Manipuri, Sanskrit, and Bodo, using Hindi, Marathi, and Bengali as pivot languages. We find that multi-source pivoting yields marginal improvements over the state-of-the-art, contrary to previous claims, but these improvements can be enhanced with synthetic target language data. We believe multi-source pivoting is a promising direction for Low-resource translation.

artificial intelligence, machine translation, natural language, (16 more...)

arXiv.org Artificial Intelligence

2406.13332

Country:

Europe (1.00)
North America > United States > Texas (0.14)
North America > United States > California (0.14)

Genre: Research Report (0.64)

Technology: Information Technology > Artificial Intelligence > Natural Language > Machine Translation (1.00)

Add feedback

Do Not Worry if You Do Not Have Data: Building Pretrained Language Models Using Translationese

Doshi, Meet, Dabre, Raj, Bhattacharyya, Pushpak

arXiv.org Artificial IntelligenceMar-21-2024

In this paper, we explore the utility of Translationese as synthetic data created using machine translation for pre-training language models (LMs). Pre-training requires vast amounts of monolingual data, which is mostly unavailable for languages other than English. Recently, there has been a growing interest in using synthetic data to address this data scarcity. We take the case of English and Indic languages and translate web-crawled monolingual documents (clean) into the target language. Then, we train language models containing 28M and 85M parameters on this translationese data (synthetic). We show that their performance on downstream natural language understanding and generative tasks is only 3.56% poorer on NLU tasks and 1.51% on NLG tasks than LMs pre-trained on clean data. Further, we propose the use of lightweight TinyLMs pre-trained on clean data to filter synthetic data efficiently which significantly improves the performance of our models. We also find that LMs trained on synthetic data strongly benefit from extended pretraining on a tiny fraction (10%) of clean data. We release the data we collected and created as a part of this work, IndicMonoDoc, the largest collection of monolingual document-level corpora, which we hope will help bridge the gap between English and non-English performance for large language models.

artificial intelligence, computational linguistic, natural language, (18 more...)

arXiv.org Artificial Intelligence

2403.13638

Country:

Asia > Middle East > UAE (0.14)
North America > United States > Texas (0.14)
North America > United States > Pennsylvania (0.14)
(2 more...)

Genre: Research Report (0.64)

Technology: Information Technology > Artificial Intelligence > Natural Language > Machine Translation (1.00)

Add feedback

PUB: A Pragmatics Understanding Benchmark for Assessing LLMs' Pragmatics Capabilities

Sravanthi, Settaluri Lakshmi, Doshi, Meet, Kalyan, Tankala Pavan, Murthy, Rudra, Bhattacharyya, Pushpak, Dabre, Raj

arXiv.org Artificial IntelligenceJan-13-2024

LLMs have demonstrated remarkable capability for understanding semantics, but they often struggle with understanding pragmatics. To demonstrate this fact, we release a Pragmatics Understanding Benchmark (PUB) dataset consisting of fourteen tasks in four pragmatics phenomena, namely, Implicature, Presupposition, Reference, and Deixis. We curated high-quality test sets for each task, consisting of Multiple Choice Question Answers (MCQA). PUB includes a total of 28k data points, 6.1k of which have been created by us, and the rest are adapted from existing datasets. We evaluated nine models varying in the number of parameters and type of training. Our study indicates that fine-tuning for instruction-following and chat significantly enhances the pragmatics capabilities of smaller language models. However, for larger models, the base versions perform comparably with their chat-adapted counterparts. Additionally, there is a noticeable performance gap between human capabilities and model capabilities. Furthermore, unlike the consistent performance of humans across various tasks, the models demonstrate variability in their proficiency, with performance levels fluctuating due to different hints and the complexities of tasks within the same dataset. Overall, the benchmark aims to provide a comprehensive evaluation of LLM's ability to handle real-world language tasks that require pragmatic reasoning.

large language model, machine learning, natural language, (18 more...)

arXiv.org Artificial Intelligence

2401.07078

Country:

Europe (0.93)
Asia (0.68)
North America > United States > Louisiana (0.14)

Genre: Research Report (1.00)

Industry: Education (0.48)

Technology: Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)

Add feedback