AITopics | indicbart

Collaborating Authors

indicbart

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

RoundTripOCR: A Data Generation Technique for Enhancing Post-OCR Error Correction in Low-Resource Devanagari Languages

Kashid, Harshvivek, Bhattacharyya, Pushpak

arXiv.org Artificial IntelligenceDec-14-2024

Optical Character Recognition (OCR) technology has revolutionized the digitization of printed text, enabling efficient data extraction and analysis across various domains. Just like Machine Translation systems, OCR systems are prone to errors. In this work, we address the challenge of data generation and post-OCR error correction, specifically for low-resource languages. We propose an approach for synthetic data generation for Devanagari languages, RoundTripOCR, that tackles the scarcity of the post-OCR Error Correction datasets for low-resource languages. We release post-OCR text correction datasets for Hindi, Marathi, Bodo, Nepali, Konkani and Sanskrit. We also present a novel approach for OCR error correction by leveraging techniques from machine translation. Our method involves translating erroneous OCR output into a corrected form by treating the OCR errors as mistranslations in a parallel text corpus, employing pre-trained transformer models to learn the mapping from erroneous to correct text pairs, effectively correcting OCR errors.

computational linguistic, machine learning, natural language, (21 more...)

arXiv.org Artificial Intelligence

2412.15248

Country:

North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
Europe > Ireland > Leinster > County Dublin > Dublin (0.04)
Oceania > Australia > Victoria > Melbourne (0.04)
(12 more...)

Genre: Research Report > Promising Solution (0.34)

Technology:

Information Technology > Data Science (1.00)
Information Technology > Artificial Intelligence > Vision > Optical Character Recognition (1.00)
Information Technology > Artificial Intelligence > Natural Language > Machine Translation (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)

Add feedback

L3Cube-MahaSum: A Comprehensive Dataset and BART Models for Abstractive Text Summarization in Marathi

Deshmukh, Pranita, Kulkarni, Nikita, Kulkarni, Sanhita, Manghani, Kareena, Joshi, Raviraj

arXiv.org Artificial IntelligenceOct-11-2024

We present the MahaSUM dataset, a large-scale collection of diverse news articles in Marathi, designed to facilitate the training and evaluation of models for abstractive summarization tas ks in Indic languages. The dataset, containing 25k samples, was create d by scraping articles from a wide range of online news sources and manuall y verifying the abstract summaries. Additionally, we train an IndicBAR T model, a variant of the BART model tailored for Indic languages, usin g the Maha-SUM dataset. We evaluate the performance of our trained mode ls on the task of abstractive summarization and demonstrate their eff ectiveness in producing high-quality summaries in Marathi. Our work cont ributes to the advancement of natural language processing research in Indic languages and provides a valuable resource for future research in this area using state-of-the-art models.

dataset, marathi, summarization, (13 more...)

arXiv.org Artificial Intelligence

2410.09184

Country:

Asia > India > Maharashtra > Pune (0.05)
Asia > India > Tamil Nadu > Chennai (0.04)

Genre:

Research Report > New Finding (0.46)
Research Report > Promising Solution (0.34)

Industry:

Education (0.46)
Media > News (0.34)

Technology:

Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.47)

Add feedback

Automatic Data Retrieval for Cross Lingual Summarization

Bhatnagar, Nikhilesh, Urlana, Ashok, Mujadia, Vandan, Mishra, Pruthwik, Sharma, Dipti Misra

arXiv.org Artificial IntelligenceDec-22-2023

Cross-lingual summarization involves the summarization of text written in one language to a different one. There is a body of research addressing cross-lingual summarization from English to other European languages. In this work, we aim to perform cross-lingual summarization from English to Hindi. We propose pairing up the coverage of newsworthy events in textual and video format can prove to be helpful for data acquisition for cross lingual summarization. We analyze the data and propose methods to match articles to video descriptions that serve as document and summary pairs. We also outline filtering methods over reasonable thresholds to ensure the correctness of the summaries. Further, we make available 28,583 mono and cross-lingual article-summary pairs https://github.com/tingc9/Cross-Sum-News-Aligned. We also build and analyze multiple baselines on the collected data and report error analysis.

computational linguistic, cross-lingual summarization, summarization, (11 more...)

arXiv.org Artificial Intelligence

2312.14542

Country:

Asia > Sri Lanka (0.05)
Asia > India > Maharashtra > Mumbai (0.05)
North America > Dominican Republic (0.04)
(8 more...)

Genre: Research Report (0.50)

Industry: Government (0.46)

Technology:

Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)

Add feedback

S\={a}mayik: A Benchmark and Dataset for English-Sanskrit Translation

Maheshwari, Ayush, Gupta, Ashim, Krishna, Amrith, Ramakrishnan, Ganesh, Kumar, G. Anil, Singla, Jitin

arXiv.org Artificial IntelligenceMay-23-2023

Sanskrit is a low-resource language with a rich heritage. Digitized Sanskrit corpora reflective of the contemporary usage of Sanskrit, specifically that too in prose, is heavily under-represented at present. Presently, no such English-Sanskrit parallel dataset is publicly available. We release a dataset, S\={a}mayik, of more than 42,000 parallel English-Sanskrit sentences, from four different corpora that aim to bridge this gap. Moreover, we also release benchmarks adapted from existing multilingual pretrained models for Sanskrit-English translation. We include training splits from our contemporary dataset and the Sanskrit-English parallel sentences from the training split of Itih\={a}sa, a previously released classical era machine translation dataset containing Sanskrit.

artificial intelligence, natural language, sanskrit, (17 more...)

arXiv.org Artificial Intelligence

2305.14004

Country:

Asia > India > Uttarakhand > Roorkee (0.05)
North America > United States > Utah (0.04)
Europe > Ireland > Leinster > County Dublin > Dublin (0.04)
(4 more...)

Genre: Research Report (0.50)

Technology: Information Technology > Artificial Intelligence > Natural Language > Machine Translation (0.49)

Add feedback

Summarizing Indian Languages using Multilingual Transformers based Models

Taunk, Dhaval, Varma, Vasudeva

arXiv.org Artificial IntelligenceMar-29-2023

With the advent of multilingual models like mBART, mT5, IndicBART etc., summarization in low resource Indian languages is getting a lot of attention now a days. But still the number of datasets is low in number. In this work, we (Team HakunaMatata) study how these multilingual models perform on the datasets which have Indian languages as source and target text while performing summarization. We experimented with IndicBART and mT5 models to perform the experiments and report the ROUGE-1, ROUGE-2, ROUGE-3 and ROUGE-4 scores as a performance metric.

large language model, machine learning, natural language, (19 more...)

arXiv.org Artificial Intelligence

2303.16657

Country:

Europe > Ireland > Leinster > County Dublin > Dublin (0.05)
North America > United States > Washington > King County > Seattle (0.04)
North America > Canada (0.04)
(5 more...)

Genre: Research Report (0.40)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.51)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.41)

Add feedback

IndicNLG Benchmark: Multilingual Datasets for Diverse NLG Tasks in Indic Languages

Kumar, Aman, Shrotriya, Himani, Sahu, Prachi, Dabre, Raj, Puduppully, Ratish, Kunchukuttan, Anoop, Mishra, Amogh, Khapra, Mitesh M., Kumar, Pratyush

arXiv.org Artificial IntelligenceOct-26-2022

Natural Language Generation (NLG) for non-English languages is hampered by the scarcity of datasets in these languages. In this paper, we present the IndicNLG Benchmark, a collection of datasets for benchmarking NLG for 11 Indic languages. We focus on five diverse tasks, namely, biography generation using Wikipedia infoboxes, news headline generation, sentence summarization, paraphrase generation and, question generation. We describe the created datasets and use them to benchmark the performance of several monolingual and multilingual baselines that leverage pre-trained sequence-to-sequence models. Our results exhibit the strong performance of multilingual language-specific pre-trained models, and the utility of models trained on our dataset for other related NLG tasks. Our dataset creation methods can be easily applied to modest-resource languages as they involve simple steps such as scraping news articles and Wikipedia infoboxes, light cleaning, and pivoting through machine translation data. To the best of our knowledge, the IndicNLG Benchmark is the first NLG benchmark for Indic languages and the most diverse multilingual NLG dataset, with approximately 8M examples across 5 tasks and 11 languages. The datasets and models are publicly available at https://ai4bharat.iitm.ac.in/indicnlg-suite.

artificial intelligence, machine learning, natural language, (16 more...)

arXiv.org Artificial Intelligence

2203.05437

Country:

North America > Canada > British Columbia > Metro Vancouver Regional District > Vancouver (0.14)
Asia > India > Jharkhand (0.05)
Europe > France (0.04)
(15 more...)

Genre: Research Report > New Finding (0.34)

Industry:

Leisure & Entertainment > Sports > Football (0.93)
Government (0.93)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Machine Translation (0.66)
Information Technology > Artificial Intelligence > Natural Language > Generation (0.54)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.46)

Add feedback

IndicBART: A Pre-trained Model for Natural Language Generation of Indic Languages

Dabre, Raj, Shrotriya, Himani, Kunchukuttan, Anoop, Puduppully, Ratish, Khapra, Mitesh M., Kumar, Pratyush

arXiv.org Artificial IntelligenceSep-7-2021

In this paper we present IndicBART, a multilingual, sequence-to-sequence pre-trained model focusing on 11 Indic languages and English. Different from existing pre-trained models, IndicBART utilizes the orthographic similarity between Indic scripts to improve transfer learning between similar Indic languages. We evaluate IndicBART on two NLG tasks: Neural Machine Translation (NMT) and extreme summarization. Our experiments on NMT for 12 language pairs and extreme summarization for 7 languages using multilingual fine-tuning show that IndicBART is competitive with or better than mBART50 despite containing significantly fewer parameters. Our analyses focus on identifying the impact of script unification (to Devanagari), corpora size as well as multilingualism on the final performance. The IndicBART model is available under the MIT license at https://indicnlp.ai4bharat.org/indic-bart .

computational linguistic, indicbart, translation, (13 more...)

arXiv.org Artificial Intelligence

2109.02903

Country:

North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
Europe > Belgium > Brussels-Capital Region > Brussels (0.04)
Asia > China > Hong Kong (0.04)
(14 more...)

Genre: Research Report (0.50)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Machine Translation (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)

Add feedback