Goto

Collaborating Authors

 Meghalaya


FairI Tales: Evaluation of Fairness in Indian Contexts with a Focus on Bias and Stereotypes

arXiv.org Artificial Intelligence

Existing studies on fairness are largely Western-focused, making them inadequate for culturally diverse countries such as India. To address this gap, we introduce INDIC-BIAS, a comprehensive India-centric benchmark designed to evaluate fairness of LLMs across 85 identity groups encompassing diverse castes, religions, regions, and tribes. We first consult domain experts to curate over 1,800 socio-cultural topics spanning behaviors and situations, where biases and stereotypes are likely to emerge. Grounded in these topics, we generate and manually validate 20,000 real-world scenario templates to probe LLMs for fairness. We structure these templates into three evaluation tasks: plausibility, judgment, and generation. Our evaluation of 14 popular LLMs on these tasks reveals strong negative biases against marginalized identities, with models frequently reinforcing common stereotypes. Additionally, we find that models struggle to mitigate bias even when explicitly asked to rationalize their decision. Our evaluation provides evidence of both allocative and representational harms that current LLMs could cause towards Indian identities, calling for a more cautious usage in practical applications. We release INDIC-BIAS as an open-source benchmark to advance research on benchmarking and mitigating biases and stereotypes in the Indian context.


Generate, Discriminate, Evolve: Enhancing Context Faithfulness via Fine-Grained Sentence-Level Self-Evolution

arXiv.org Artificial Intelligence

Improving context faithfulness in large language models is essential for developing trustworthy retrieval augmented generation systems and mitigating hallucinations, especially in long-form question answering (LFQA) tasks or scenarios involving knowledge conflicts. Existing methods either intervene LLMs only at inference without addressing their inherent limitations or overlook the potential for self-improvement. In this paper, we introduce GenDiE (Generate, Discriminate, Evolve), a novel self-evolving framework that enhances context faithfulness through fine-grained sentence-level optimization. GenDiE combines both generative and discriminative training, equipping LLMs with self-generation and self-scoring capabilities to facilitate iterative self-evolution. This supports both data construction for model alignment and score-guided search during inference. Furthermore, by treating each sentence in a response as an independent optimization unit, GenDiE effectively addresses the limitations of previous approaches that optimize at the holistic answer level, which may miss unfaithful details. Experiments on ASQA (in-domain LFQA) and ConFiQA (out-of-domain counterfactual QA) datasets demonstrate that GenDiE surpasses various baselines in both faithfulness and correctness, and exhibits robust performance for domain adaptation.


Injecting Explainability and Lightweight Design into Weakly Supervised Video Anomaly Detection Systems

arXiv.org Artificial Intelligence

Weakly Supervised Monitoring Anomaly Detection (WSMAD) utilizes weak supervision learning to identify anomalies, a critical task for smart city monitoring. However, existing multimodal approaches often fail to meet the real-time and interpretability requirements of edge devices due to their complexity. This paper presents TCVADS (Two-stage Cross-modal Video Anomaly Detection System), which leverages knowledge distillation and cross-modal contrastive learning to enable efficient, accurate, and interpretable anomaly detection on edge devices.TCVADS operates in two stages: coarse-grained rapid classification and fine-grained detailed analysis. In the first stage, TCVADS extracts features from video frames and inputs them into a time series analysis module, which acts as the teacher model. Insights are then transferred via knowledge distillation to a simplified convolutional network (student model) for binary classification. Upon detecting an anomaly, the second stage is triggered, employing a fine-grained multi-class classification model. This stage uses CLIP for cross-modal contrastive learning with text and images, enhancing interpretability and achieving refined classification through specially designed triplet textual relationships. Experimental results demonstrate that TCVADS significantly outperforms existing methods in model performance, detection efficiency, and interpretability, offering valuable contributions to smart city monitoring applications.


A Breadth-First Catalog of Text Processing, Speech Processing and Multimodal Research in South Asian Languages

arXiv.org Artificial Intelligence

We review the recent literature (January 2022- October 2024) in South Asian languages on text-based language processing, multimodal models, and speech processing, and provide a spotlight analysis focused on 21 low-resource South Asian languages, namely Saraiki, Assamese, Balochi, Bhojpuri, Bodo, Burmese, Chhattisgarhi, Dhivehi, Gujarati, Kannada, Kashmiri, Konkani, Khasi, Malayalam, Meitei, Nepali, Odia, Pashto, Rajasthani, Sindhi, and Telugu. We identify trends, challenges, and future research directions, using a step-wise approach that incorporates relevance classification and clustering based on large language models (LLMs). Our goal is to provide a breadth-first overview of the recent developments in South Asian language technologies to NLP researchers interested in working with South Asian languages.


Explaining the Unexplained: Revealing Hidden Correlations for Better Interpretability

arXiv.org Artificial Intelligence

Thanks to the rapid advancement of computer hardware, deep learning has made significant progress in the application of unstructured data, such as images (Cao & Chen, 2025) and text (Li et al., 2024). Specifically, the success of representation learning (Wang & Lian, 2025; Zhang et al., 2025) has gradually replaced the earlier approaches of transforming unstructured data into structured formats. The key to the success of representation learning lies in leveraging a large number of parameters for backpropagation, enabling the model to adapt to data with non-normal distributions. Although models based on backpropagation neural networks (Yang et al., 2019; Banerjee et al., 2023) have achieved significant technical advancements, their application in many sensitive domains, such as medicine (Zhang et al., 2025) and industrial inspection (Rathee et al., 2021), still faces considerable challenges due to the difficulty in understanding the basis of their decision-making. Explainable Artificial Intelligence (XAI) aims to reveal the inner mechanisms of neural network decisions, thereby making these models more reliable for applications in sensitive domains. In recent years, several studies (Li et al., 2025; Jing et al., 2025; Liu et al., 2024; Guan et al., 2024) have focused on injecting explainability into deep learning models and using various visualization techniques to explain the decisions of these "black box" models. While these models have achieved a certain level of interpretability, two pressing issues remain (Huang & Marques, 2023; Huang & Marques, 2024): first, whether the correlations between different attributes are correctly evaluated, and second, whether the model's decision-making pathway truly aligns with human reasoning, even when the model's understanding appears consistent with user expectations.


From Explicit Rules to Implicit Reasoning in an Interpretable Violence Monitoring System

arXiv.org Artificial Intelligence

Recently, research based on pre-trained models has demonstrated outstanding performance in violence surveillance tasks. However, most of them were black-box systems which faced challenges regarding explainability during training and inference processes. An important question is how to incorporate explicit knowledge into these implicit models, thereby designing expertdriven and interpretable violence surveillance systems. This paper proposes a new paradigm for weakly supervised violence monitoring (WSVM) called Rule base Violence Monitoring (RuleVM). The proposed RuleVM uses a dual-branch structure with different designs for images and text. One of the branches is called the implicit branch, which uses only visual features for coarse-grained binary classification. In this branch, image feature extraction is divided into two channels: one responsible for extracting scene frames and the other focusing on extracting actions. The other branch is called the explicit branch, which utilizes language-image alignment to perform fine-grained classification. For the language channel design in the explicit branch, the proposed RuleVM uses the state-of-the-art YOLOWorld model to detect objects in video frames, and association rules are identified through data mining methods as descriptions of the video. Leveraging the dual-branch architecture, RuleVM achieves interpretable coarse-grained and fine-grained violence surveillance. Extensive experiments were conducted on two commonly used benchmarks, and the results show that RuleVM achieved the best performance in both coarse-grained and finegrained monitoring, significantly outperforming existing state-ofthe-art methods. Moreover, interpretability experiments uncovered some interesting rules, such as the observation that as the number of people increases, the risk level of violent behavior also rises.


SPRING Lab IITM's submission to Low Resource Indic Language Translation Shared Task

arXiv.org Artificial Intelligence

We develop a robust translation model for four low-resource Indic languages: Khasi, Mizo, Manipuri, and Assamese. Our approach includes a comprehensive pipeline from data collection and preprocessing to training and evaluation, leveraging data from WMT task datasets, BPCC, PMIndia, and OpenLanguageData. To address the scarcity of bilingual data, we use back-translation techniques on monolingual datasets for Mizo and Khasi, significantly expanding our training corpus. We fine-tune the pre-trained NLLB 3.3B model for Assamese, Mizo, and Manipuri, achieving improved performance over the baseline. For Khasi, which is not supported by the NLLB model, we introduce special tokens and train the model on our Khasi corpus. Our training involves masked language modelling, followed by fine-tuning for English-to-Indic and Indic-to-English translations.


Agricultural Landscape Understanding At Country-Scale

arXiv.org Artificial Intelligence

The global food system is facing unprecedented challenges. In 2023, 2.4 billion people experienced moderate to severe food insecurity [1], a crisis precipitated by anthropogenic climate change and evolving dietary preferences. Furthermore, the food system itself significantly contributes to the climate crisis, with food loss and waste accounting for 2.4 gigatonnes of carbon dioxide equivalent emissions per year (GT CO2e/yr) [2], and the production, mismanagement, and misapplication of agricultural inputs such as fertilizers and manure generating an additional 2.5 GT CO2e/yr [3]. To sustain a projected global population of 9.6 billion by 2050, the Food and Agriculture Organization (FAO) estimates that food production must increase by at least 60% [1]. However, this also presents an opportunity: transitioning to sustainable agricultural practices can transform the sector from a net source of greenhouse gas emissions to a vital carbon sink.


Investigating Large Language Models for Complex Word Identification in Multilingual and Multidomain Setups

arXiv.org Artificial Intelligence

Complex Word Identification (CWI) is an essential step in the lexical simplification task and has recently become a task on its own. Some variations of this binary classification task have emerged, such as lexical complexity prediction (LCP) and complexity evaluation of multi-word expressions (MWE). Large language models (LLMs) recently became popular in the Natural Language Processing community because of their versatility and capability to solve unseen tasks in zero/few-shot settings. Our work investigates LLM usage, specifically open-source models such as Llama 2, Llama 3, and Vicuna v1.5, and closed-source, such as ChatGPT-3.5-turbo and GPT-4o, in the CWI, LCP, and MWE settings. We evaluate zero-shot, few-shot, and fine-tuning settings and show that LLMs struggle in certain conditions or achieve comparable results against existing methods. In addition, we provide some views on meta-learning combined with prompt learning. In the end, we conclude that the current state of LLMs cannot or barely outperform existing methods, which are usually much smaller.


LAHAJA: A Robust Multi-accent Benchmark for Evaluating Hindi ASR Systems

arXiv.org Artificial Intelligence

Hindi, one of the most spoken language of India, exhibits a diverse array of accents due to its usage among individuals from diverse linguistic origins. To enable a robust evaluation of Hindi ASR systems on multiple accents, we create a benchmark, LAHAJA, which contains read and extempore speech on a diverse set of topics and use cases, with a total of 12.5 hours of Hindi audio, sourced from 132 speakers spanning 83 districts of India. We evaluate existing open-source and commercial models on LAHAJA and find their performance to be poor. We then train models using different datasets and find that our model trained on multilingual data with good speaker diversity outperforms existing models by a significant margin. We also present a fine-grained analysis which shows that the performance declines for speakers from North-East and South India, especially with content heavy in named entities and specialized terminology.