AITopics | Mizoram

Collaborating Authors

Mizoram

Health Sentinel: An AI Pipeline For Real-time Disease Outbreak Detection

Pant, Devesh, Grandhe, Rishi Raj, Samaria, Vipin, Paul, Mukul, Kumar, Sudhir, Khanna, Saransh, Agrawal, Jatin, Kalra, Jushaan Singh, VSSG, Akhil, Khalikar, Satish V, Garg, Vipin, Chauhan, Himanshu, Verma, Pranay, Khandelwal, Neha, Dhavala, Soma S, Mathew, Minesh

arXiv.org Artificial IntelligenceNov-26-2025

Early detection of disease outbreaks is crucial to ensure timely intervention by the health authorities. Due to the challenges associated with traditional indicator-based surveillance, monitoring informal sources such as online media has become increasingly popular. However, owing to the number of online articles getting published everyday, manual screening of the articles is impractical. To address this, we propose Health Sentinel. It is a multi-stage information extraction pipeline that uses a combination of ML and non-ML methods to extract events-structured information concerning disease outbreaks or other unusual health events-from online articles. The extracted events are made available to the Media Scanning and Verification Cell (MSVC) at the National Centre for Disease Control (NCDC), Delhi for analysis, interpretation and further dissemination to local agencies for timely intervention. From April 2022 till date, Health Sentinel has processed over 300 million news articles and identified over 95,000 unique health events across India of which over 3,500 events were shortlisted by the public health experts at NCDC as potential outbreaks.

information, large language model, machine learning, (20 more...)

arXiv.org Artificial Intelligence

2506.19548

Country:

Asia > North Korea (0.04)
Asia > India > Chhattisgarh (0.04)
North America > United States > Iowa (0.04)
(5 more...)

Genre: Research Report (0.64)

Industry:

Health & Medicine > Therapeutic Area > Infections and Infectious Diseases (1.00)
Health & Medicine > Therapeutic Area > Immunology (1.00)
Health & Medicine > Public Health (1.00)
Health & Medicine > Epidemiology (1.00)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
(2 more...)

Add feedback

IndicVisionBench: Benchmarking Cultural and Multilingual Understanding in VLMs

Faraz, Ali, Akash, null, Khan, Shaharukh, Kolla, Raja, Patidar, Akshat, Goswami, Suranjan, Ravi, Abhinav, Khatri, Chandra, Agarwal, Shubham

arXiv.org Artificial IntelligenceNov-10-2025

Vision-language models (VLMs) have demonstrated impressive generalization across multimodal tasks, yet most evaluation benchmarks remain Western-centric, leaving open questions about their performance in culturally diverse and multilingual settings. To address this gap, we introduce IndicVisionBench, the first large-scale benchmark centered on the Indian subcontinent. Our final benchmark consists of a total of 5K images and 37K+ QA pairs across 13 culturally grounded topics. In addition, we release a paired parallel corpus of annotations across 10 Indic languages, creating a unique resource for analyzing cultural and linguistic biases in VLMs. We evaluate a broad spectrum of 8 models, from proprietary closed-source systems to open-weights medium and large-scale models. Our experiments reveal substantial performance gaps, underscoring the limitations of current VLMs in culturally diverse contexts. By centering cultural diversity and multilinguality, IndicVisionBench establishes a reproducible evaluation framework that paves the way for more inclusive multimodal research. Vision-language models (VLMs) (Bai et al., 2023; Chen et al., 2024; Lu et al., 2024; Wang et al., 2024b; Laurenc on et al., 2024; Tong et al., 2024; Xue et al., 2024) have demonstrated strong performance across a variety of multimodal tasks. However, existing benchmarks (Antol et al., 2015; Fu et al., 2023; Goyal et al., 2017) remain heavily Western-centric, limiting our understanding of how these models generalize to culturally diverse and multilingual settings. While some recent efforts partially cover this diversity (Romero et al., 2024; Nayak et al., 2024; V ayani et al., 2025), a systematic, large-scale benchmark capturing India-specific cultural concepts across multiple languages is still lacking. To address this gap, we introduce IndicVisionBench, a culturally grounded evaluation benchmark tailored for the Indian subcontinent. To the best of our knowledge, this is the first large-scale benchmark explicitly designed to assess VLMs in the context of Indian culture and languages. We use states as a proxy for cultural groups following prior works (Adilazuarda et al., 2024; Nayak et al., 2024).

arxiv preprint arxiv, large language model, machine learning, (18 more...)

arXiv.org Artificial Intelligence

2511.04727

Country:

Europe > Denmark > Capital Region > Copenhagen (0.04)
Asia > India > Tamil Nadu (0.04)
Asia > India > Nagaland (0.04)
(25 more...)

Genre: Research Report (1.00)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Chatbot (0.99)

Add feedback

DRISHTIKON: A Multimodal Multilingual Benchmark for Testing Language Models' Understanding on Indian Culture

Maji, Arijit, Kumar, Raghvendra, Ghosh, Akash, Anushka, null, Shah, Nemil, Borah, Abhilekh, Shah, Vanshika, Mishra, Nishant, Saha, Sriparna

arXiv.org Artificial IntelligenceSep-24-2025

We introduce DRISHTIKON, a first-of-its-kind multimodal and multilingual benchmark centered exclusively on Indian culture, designed to evaluate the cultural understanding of generative AI systems. Unlike existing benchmarks with a generic or global scope, DRISHTIKON offers deep, fine-grained coverage across India's diverse regions, spanning 15 languages, covering all states and union territories, and incorporating over 64,000 aligned text-image pairs. The dataset captures rich cultural themes including festivals, attire, cuisines, art forms, and historical heritage amongst many more. We evaluate a wide range of vision-language models (VLMs), including open-source small and large models, proprietary systems, reasoning-specialized VLMs, and Indic-focused models, across zero-shot and chain-of-thought settings. Our results expose key limitations in current models' ability to reason over culturally grounded, multimodal inputs, particularly for low-resource languages and less-documented traditions. DRISHTIKON fills a vital gap in inclusive AI research, offering a robust testbed to advance culturally aware, multimodally competent language technologies.

arxiv preprint arxiv, large language model, machine learning, (19 more...)

arXiv.org Artificial Intelligence

2509.19274

Country:

Europe > Austria > Vienna (0.14)
North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
Asia > India > Jharkhand (0.04)
(24 more...)

Genre:

Overview (1.00)
Research Report > New Finding (0.48)

Industry: Leisure & Entertainment (0.92)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

Better To Ask in English? Evaluating Factual Accuracy of Multilingual LLMs in English and Low-Resource Languages

Rohera, Pritika, Ginimav, Chaitrali, Sawant, Gayatri, Joshi, Raviraj

arXiv.org Artificial IntelligenceSep-16-2025

Multilingual Large Language Models (LLMs) have demonstrated significant effectiveness across various languages, particularly in high-resource languages such as English. However, their performance in terms of factual accuracy across other low-resource languages, especially Indic languages, remains an area of investigation. In this study, we assess the factual accuracy of LLMs - GPT-4o, Gemma-2-9B, Gemma-2-2B, and Llama-3.1-8B - by comparing their performance in English and Indic languages using the IndicQuest dataset, which contains question-answer pairs in English and 19 Indic languages. By asking the same questions in English and their respective Indic translations, we analyze whether the models are more reliable for regional context questions in Indic languages or when operating in English. Our findings reveal that LLMs often perform better in English, even for questions rooted in Indic contexts. Notably, we observe a higher tendency for hallucination in responses generated in low-resource Indic languages, highlighting challenges in the multilingual understanding capabilities of current LLMs.

large language model, machine learning, natural language, (19 more...)

arXiv.org Artificial Intelligence

2504.20022

Country:

Asia > Thailand > Bangkok > Bangkok (0.04)
Asia > India > West Bengal (0.04)
Asia > India > Uttarakhand (0.04)
(21 more...)

Genre: Research Report > New Finding (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.59)

Add feedback

Wavelet-SARIMA-Transformer: A Hybrid Model for Rainfall Forecasting

Saikia, Junmoni, Goswami, Kuldeep, Kakaty, Sarat C.

arXiv.org Artificial IntelligenceSep-16-2025

This study develops and evaluates a novel hybridWavelet SARIMA Transformer, WST framework to forecast using monthly rainfall across five meteorological subdivisions of Northeast India over the 1971 to 2023 period. The approach employs the Maximal Overlap Discrete Wavelet Transform, MODWT with four wavelet families such as, Haar, Daubechies, Symlet, Coiflet etc. to achieve shift invariant, multiresolution decomposition of the rainfall series. Linear and seasonal components are modeled using Seasonal ARIMA, SARIMA, while nonlinear components are modeled by a Transformer network, and forecasts are reconstructed via inverse MODWT. Comprehensive validation using an 80 is to 20 train test split and multiple performance indices such as, RMSE, MAE, SMAPE, Willmotts d, Skill Score, Percent Bias, Explained Variance, and Legates McCabes E1 demonstrates the superiority of the Haar-based hybrid model, WHST. Across all subdivisions, WHST consistently achieved lower forecast errors, stronger agreement with observed rainfall, and unbiased predictions compared with stand alone SARIMA, stand-alone Transformer, and two-stage wavelet hybrids. Residual adequacy was confirmed through the Ljung Box test, while Taylor diagrams provided an integrated assessment of correlation, variance fidelity, and RMSE, further reinforcing the robustness of the proposed approach. The results highlight the effectiveness of integrating multiresolution signal decomposition with complementary linear and deep learning models for hydroclimatic forecasting. Beyond rainfall, the proposed WST framework offers a scalable methodology for forecasting complex environmental time series, with direct implications for flood risk management, water resources planning, and climate adaptation strategies in data-sparse and climate-sensitive regions.

forecasting, machine learning, natural language, (18 more...)

arXiv.org Artificial Intelligence

2509.11903

Country:

North America > Trinidad and Tobago > Trinidad > Arima > Arima (0.25)
Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.14)
Asia > India > West Bengal (0.05)
(12 more...)

Genre: Research Report > Experimental Study (0.66)

Technology:

Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.94)
Information Technology > Data Science > Data Quality > Data Transformation (0.87)

Add feedback

Farm-Level, In-Season Crop Identification for India

Deshpande, Ishan, Reehal, Amandeep Kaur, Nath, Chandan, Singh, Renu, Patel, Aayush, Jayagopal, Aishwarya, Singh, Gaurav, Aggarwal, Gaurav, Agarwal, Amit, Bele, Prathmesh, Reddy, Sridhar, Warrier, Tanya, Singh, Kinjal, Tendulkar, Ashish, Outon, Luis Pazos, Saxena, Nikita, Dondzik, Agata, Tewari, Dinesh, Garg, Shruti, Singh, Avneet, Dhand, Harsh, Rajan, Vaibhav, Talekar, Alok

arXiv.org Artificial IntelligenceJul-8-2025

Accurate, timely, and farm-level crop type information is paramount for national food security, agricultural policy formulation, and economic planning, particularly in agriculturally significant nations like India. While remote sensing and machine learning have become vital tools for crop monitoring, existing approaches often grapple with challenges such as limited geographical scalability, restricted crop type coverage, the complexities of mixed-pixel and heterogeneous landscapes, and crucially, the robust in-season identification essential for proactive decision-making. We present a framework designed to address the critical data gaps for targeted data driven decision making which generates farm-level, in-season, multi-crop identification at national scale (India) using deep learning. Our methodology leverages the strengths of Sentinel-1 and Sentinel-2 satellite imagery, integrated with national-scale farm boundary data. The model successfully identifies 12 major crops (which collectively account for nearly 90% of India's total cultivated area showing an agreement with national crop census 2023-24 of 94% in winter, and 75% in monsoon season). Our approach incorporates an automated season detection algorithm, which estimates crop sowing and harvest periods. This allows for reliable crop identification as early as two months into the growing season and facilitates rigorous in-season performance evaluation. Furthermore, we have engineered a highly scalable inference pipeline, culminating in what is, to our knowledge, the first pan-India, in-season, farm-level crop type data product. The system's effectiveness and scalability are demonstrated through robust validation against national agricultural statistics, showcasing its potential to deliver actionable, data-driven insights for transformative agricultural monitoring and management across India.

classification, machine learning, natural language, (18 more...)

arXiv.org Artificial Intelligence

2507.02972

Country:

North America > United States (0.28)
North America > Canada > Ontario (0.14)
South America > Peru (0.04)
(16 more...)

Genre: Research Report (0.41)

Industry:

Food & Agriculture > Agriculture (1.00)
Energy > Renewable > Geothermal > Geothermal Energy Exploration and Development > Geophysical Analysis & Survey (0.57)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
Information Technology > Artificial Intelligence > Natural Language (0.94)

Add feedback

FairI Tales: Evaluation of Fairness in Indian Contexts with a Focus on Bias and Stereotypes

Nawale, Janki Atul, Khan, Mohammed Safi Ur Rahman, D, Janani, Gupta, Mansi, Pruthi, Danish, Khapra, Mitesh M.

arXiv.org Artificial IntelligenceJul-1-2025

Existing studies on fairness are largely Western-focused, making them inadequate for culturally diverse countries such as India. To address this gap, we introduce INDIC-BIAS, a comprehensive India-centric benchmark designed to evaluate fairness of LLMs across 85 identity groups encompassing diverse castes, religions, regions, and tribes. We first consult domain experts to curate over 1,800 socio-cultural topics spanning behaviors and situations, where biases and stereotypes are likely to emerge. Grounded in these topics, we generate and manually validate 20,000 real-world scenario templates to probe LLMs for fairness. We structure these templates into three evaluation tasks: plausibility, judgment, and generation. Our evaluation of 14 popular LLMs on these tasks reveals strong negative biases against marginalized identities, with models frequently reinforcing common stereotypes. Additionally, we find that models struggle to mitigate bias even when explicitly asked to rationalize their decision. Our evaluation provides evidence of both allocative and representational harms that current LLMs could cause towards Indian identities, calling for a more cautious usage in practical applications. We release INDIC-BIAS as an open-source benchmark to advance research on benchmarking and mitigating biases and stereotypes in the Indian context.

large language model, machine learning, natural language, (18 more...)

arXiv.org Artificial Intelligence

2506.23111

Country:

Asia > India > Bihar (0.14)
Europe > United Kingdom > England > Oxfordshire > Oxford (0.14)
Asia > India > Uttar Pradesh (0.04)
(37 more...)

Genre: Research Report > New Finding (0.67)

Industry:

Law > Civil Rights & Constitutional Law (1.00)
Law Enforcement & Public Safety > Crime Prevention & Enforcement (1.00)
Education > Educational Setting (0.92)
Government > Regional Government > Asia Government > India Government (0.46)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.94)

Add feedback

Tone recognition in low-resource languages of North-East India: peeling the layers of SSL-based speech models

Gogoi, Parismita, Kalita, Sishir, Lalhminghlui, Wendy, Terhiija, Viyazonuo, Tzudir, Moakala, Sarmah, Priyankoo, Prasanna, S. R. M.

arXiv.org Artificial IntelligenceJun-5-2025

This study explores the use of self-supervised learning (SSL) models for tone recognition in three low-resource languages from North Eastern India: Angami, Ao, and Mizo. We evaluate four Wav2vec2.0 base models that were pre-trained on both tonal and non-tonal languages. We analyze tone-wise performance across the layers for all three languages and compare the different models. Our results show that tone recognition works best for Mizo and worst for Angami. The middle layers of the SSL models are the most important for tone recognition, regardless of the pre-training language, i.e. tonal or non-tonal. We have also found that the tone inventory, tone types, and dialectal variations affect tone recognition. These findings provide useful insights into the strengths and weaknesses of SSL-based embeddings for tonal languages and highlight the potential for improving tone recognition in low-resource settings. The source code is available at GitHub 1 .

artificial intelligence, machine learning, natural language, (21 more...)

arXiv.org Artificial Intelligence

2506.03606

Country:

Asia > Japan > Honshū > Kantō > Tokyo Metropolis Prefecture > Tokyo (0.14)
Asia > India > Nagaland > Kohima (0.04)
North America > Canada > Ontario > Toronto (0.04)
(10 more...)

Genre: Research Report > New Finding (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)
Information Technology > Artificial Intelligence > Speech > Speech Recognition (0.69)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Support Vector Machines (0.46)

Add feedback

Parallel Corpora for Machine Translation in Low-resource Indic Languages: A Comprehensive Review

Raja, Rahul, Vats, Arpita

arXiv.org Artificial IntelligenceMar-2-2025

Parallel corpora play an important role in training machine translation (MT) models, particularly for low-resource languages where high-quality bilingual data is scarce. This review provides a comprehensive overview of available parallel corpora for Indic languages, which span diverse linguistic families, scripts, and regional variations. We categorize these corpora into text-to-text, code-switched, and various categories of multimodal datasets, highlighting their significance in the development of robust multilingual MT systems. Beyond resource enumeration, we critically examine the challenges faced in corpus creation, including linguistic diversity, script variation, data scarcity, and the prevalence of informal textual content.We also discuss and evaluate these corpora in various terms such as alignment quality and domain representativeness. Furthermore, we address open challenges such as data imbalance across Indic languages, the trade-off between quality and quantity, and the impact of noisy, informal, and dialectal data on MT performance. Finally, we outline future directions, including leveraging cross-lingual transfer learning, expanding multilingual datasets, and integrating multimodal resources to enhance translation quality. To the best of our knowledge, this paper presents the first comprehensive review of parallel corpora specifically tailored for low-resource Indic languages in the context of machine translation.

corpora, dataset, translation, (14 more...)

arXiv.org Artificial Intelligence

2503.04797

Country:

Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.14)
North America > United States > California > San Francisco County > San Francisco (0.14)
Asia > Indonesia > Bali (0.04)
(30 more...)

Genre: Overview (1.00)

Industry:

Education (0.67)
Government (0.46)

Technology: Information Technology > Artificial Intelligence > Natural Language > Machine Translation (1.00)

Add feedback

Mitigating Hallucinated Translations in Large Language Models with Hallucination-focused Preference Optimization

Tang, Zilu, Chatterjee, Rajen, Garg, Sarthak

arXiv.org Artificial IntelligenceJan-28-2025

Machine Translation (MT) is undergoing a paradigm shift, with systems based on fine-tuned large language models (LLM) becoming increasingly competitive with traditional encoder-decoder models trained specifically for translation tasks. However, LLM-based systems are at a higher risk of generating hallucinations, which can severely undermine user's trust and safety. Most prior research on hallucination mitigation focuses on traditional MT models, with solutions that involve post-hoc mitigation - detecting hallucinated translations and re-translating them. While effective, this approach introduces additional complexity in deploying extra tools in production and also increases latency. To address these limitations, we propose a method that intrinsically learns to mitigate hallucinations during the model training phase. Specifically, we introduce a data creation framework to generate hallucination focused preference datasets. Fine-tuning LLMs on these preference datasets reduces the hallucination rate by an average of 96% across five language pairs, while preserving overall translation quality. In a zero-shot setting our approach reduces hallucinations by 89% on an average across three unseen target languages.

large language model, machine learning, natural language, (18 more...)

arXiv.org Artificial Intelligence

2501.17295

Country:

North America > Canada > Ontario > Toronto (0.04)
Asia > Middle East > UAE > Abu Dhabi Emirate > Abu Dhabi (0.04)
Asia > Middle East > Jordan (0.04)
(18 more...)

Genre: Research Report > New Finding (0.67)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Machine Translation (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)

Add feedback