Goto

Collaborating Authors

 Chandigarh


Lost in Translation and Noise: A Deep Dive into the Failure Modes of VLMs on Real-World Tables

Singh, Anshul, Chaudhary, Rohan, Singh, Gagneet, Kumary, Abhay

arXiv.org Artificial Intelligence

The impressive performance of VLMs is largely measured on benchmarks that fail to capture the complexities of real-world scenarios. Existing datasets for tabular QA, such as WikiTableQuestions and FinQA, are overwhelmingly monolingual (English) and present tables in a digitally perfect, clean format. This creates a significant gap between research and practice. To address this, we present \textbf{MirageTVQA}, a new benchmark designed to evaluate VLMs on these exact dimensions. Featuring nearly 60,000 QA pairs across 24 languages, MirageTVQA challenges models with tables that are not only multilingual but also visually imperfect, incorporating realistic noise to mimic scanned documents. Our evaluation of the leading VLMs reveals two primary failure points: a severe degradation in performance (over 35\% drop for the best models) when faced with visual noise and a consistent English-first bias where reasoning abilities fail to transfer to other languages. MirageTVQA provides a benchmark for measuring and driving progress towards more robust VLM models for table reasoning. The dataset and the code are available at: https://github.com/anshulsc/MirageTVQA.


When CNNs Outperform Transformers and Mambas: Revisiting Deep Architectures for Dental Caries Segmentation

Ghimire, Aashish, Zeng, Jun, Paudel, Roshan, Tomar, Nikhil Kumar, Nayak, Deepak Ranjan, Nalla, Harshith Reddy, Jha, Vivek, Reynolds, Glenda, Jha, Debesh

arXiv.org Artificial Intelligence

Accurate identification and segmentation of dental caries in panoramic radiographs are critical for early diagnosis and effective treatment planning. Automated segmentation remains challenging due to low lesion contrast, morphological variability, and limited annotated data. In this study, we present the first comprehensive benchmarking of convolutional neural networks, vision transformers and state-space mamba architectures for automated dental caries segmentation on panoramic radiographs through a DC1000 dataset. Twelve state-of-the-art architectures, including VMUnet, MambaUNet, VMUNetv2, RMAMamba-S, TransNetR, PVTFormer, DoubleU-Net, and ResUNet++, were trained under identical configurations. Results reveal that, contrary to the growing trend toward complex attention based architectures, the CNN-based DoubleU-Net achieved the highest dice coefficient of 0.7345, mIoU of 0.5978, and precision of 0.8145, outperforming all transformer and Mamba variants. In the study, the top 3 results across all performance metrics were achieved by CNN-based architectures. Here, Mamba and transformer-based methods, despite their theoretical advantage in global context modeling, underperformed due to limited data and weaker spatial priors. These findings underscore the importance of architecture-task alignment in domain-specific medical image segmentation more than model complexity. Our code is available at: https://github.com/JunZengz/dental-caries-segmentation.


SinLlama -- A Large Language Model for Sinhala

Aravinda, H. W. K., Sirajudeen, Rashad, Karunathilake, Samith, de Silva, Nisansa, Ranathunga, Surangika, Kaur, Rishemjit

arXiv.org Artificial Intelligence

Low-resource languages such as Sinhala are often overlooked by open-source Large Language Models (LLMs). In this research, we extend an existing multilingual LLM (Llama-3-8B) to better serve Sinhala. We enhance the LLM tokenizer with Sinhala specific vocabulary and perform continual pre-training on a cleaned 10 million Sinhala corpus, resulting in the SinLlama model. This is the very first decoder-based open-source LLM with explicit Sinhala support. When SinLlama was instruction fine-tuned for three text classification tasks, it outperformed base and instruct variants of Llama-3-8B by a significant margin.


IndicVisionBench: Benchmarking Cultural and Multilingual Understanding in VLMs

Faraz, Ali, Akash, null, Khan, Shaharukh, Kolla, Raja, Patidar, Akshat, Goswami, Suranjan, Ravi, Abhinav, Khatri, Chandra, Agarwal, Shubham

arXiv.org Artificial Intelligence

Vision-language models (VLMs) have demonstrated impressive generalization across multimodal tasks, yet most evaluation benchmarks remain Western-centric, leaving open questions about their performance in culturally diverse and multilingual settings. To address this gap, we introduce IndicVisionBench, the first large-scale benchmark centered on the Indian subcontinent. Our final benchmark consists of a total of 5K images and 37K+ QA pairs across 13 culturally grounded topics. In addition, we release a paired parallel corpus of annotations across 10 Indic languages, creating a unique resource for analyzing cultural and linguistic biases in VLMs. We evaluate a broad spectrum of 8 models, from proprietary closed-source systems to open-weights medium and large-scale models. Our experiments reveal substantial performance gaps, underscoring the limitations of current VLMs in culturally diverse contexts. By centering cultural diversity and multilinguality, IndicVisionBench establishes a reproducible evaluation framework that paves the way for more inclusive multimodal research. Vision-language models (VLMs) (Bai et al., 2023; Chen et al., 2024; Lu et al., 2024; Wang et al., 2024b; Laurenc on et al., 2024; Tong et al., 2024; Xue et al., 2024) have demonstrated strong performance across a variety of multimodal tasks. However, existing benchmarks (Antol et al., 2015; Fu et al., 2023; Goyal et al., 2017) remain heavily Western-centric, limiting our understanding of how these models generalize to culturally diverse and multilingual settings. While some recent efforts partially cover this diversity (Romero et al., 2024; Nayak et al., 2024; V ayani et al., 2025), a systematic, large-scale benchmark capturing India-specific cultural concepts across multiple languages is still lacking. To address this gap, we introduce IndicVisionBench, a culturally grounded evaluation benchmark tailored for the Indian subcontinent. To the best of our knowledge, this is the first large-scale benchmark explicitly designed to assess VLMs in the context of Indian culture and languages. We use states as a proxy for cultural groups following prior works (Adilazuarda et al., 2024; Nayak et al., 2024).


Review Based Entity Ranking using Fuzzy Logic Algorithmic Approach: Analysis

Kalamkar, Pratik N., Phakatkar, Anupama G.

arXiv.org Artificial Intelligence

Pratik N. Kalamkar, Anupama G. Phakatkar Abstract -- Opinion mining, also called sentiment analysis, is the field of study that analyzes people's opinions, sentiments, evaluations, appraisals, attitudes, and emotions towards entities such as products, services, organizations, individuals, issues, events, topics, and their attributes. Holistic lexicon - based approach do es not consider the strength of each opinion, i.e., whether the opinion is very strongly negative (or positive), strongly negative (or positive), moderate negative (or positive), very weakly negative (or positive) and weakly negative (or positive). In this paper, we propose approach to rank entities based on orientation and strength of the entity's reviews and user's queries by classifying them in granularity levels (i.e. We shall use fuzzy logic algorithmic approach in order to classify opinion words into different category and syntactic dependency resolution to find relations for de sired aspect words . Opinion words related to certain aspects of interest are considered to find the entity score for that aspect in the review.


Opinion Mining Based Entity Ranking using Fuzzy Logic Algorithmic Approach

Kalamkar, Pratik N., Phakatkar, A. G.

arXiv.org Artificial Intelligence

Opinions are central to almost all human activities and are key influencers of our behaviors. In current times due to growth of social networking website and increase in number of e-commerce site huge amount of opinions are now available on web. Given a set of evaluative statements that contain opinions (or sentiments) about an Entity, opinion mining aims to extract attributes and components of the object that have been commented on in each statement and to determine whether the comments are positive, negative or neutral. While lot of research recently has been done in field of opinion mining and some of it dealing with ranking of entities based on review or opinion set, classifying opinions into finer granularity level and then ranking entities has never been done before. In this paper method for opinion mining from statements at a deeper level of granularity is proposed. This is done by using fuzzy logic reasoning, after which entities are ranked as per this information.


LLMTaxo: Leveraging Large Language Models for Constructing Taxonomy of Factual Claims from Social Media

Zhang, Haiqi, Zhu, Zhengyuan, Zhang, Zeyu, Li, Chengkai

arXiv.org Artificial Intelligence

With the rapid expansion of content on social media platforms, analyzing and comprehending online discourse has become increasingly complex. This paper introduces LLMTaxo, a novel framework leveraging large language models for the automated construction of taxonomies of factual claims from social media by generating topics at multiple levels of granularity. The resulting hierarchical structure significantly reduces redundancy and improves information accessibility. We also propose dedicated taxonomy evaluation metrics to enable comprehensive assessment. Evaluations conducted on three diverse datasets demonstrate LLMTaxo's effectiveness in producing clear, coherent, and comprehensive taxonomies. Among the evaluated models, GPT-4o mini consistently outperforms others across most metrics. The framework's flexibility and low reliance on manual intervention underscore its potential for broad applicability.


QRïS: A Preemptive Novel Method for Quishing Detection Through Structural Features of QR

Akram, Muhammad Wahid, Sood, Keshav, Hassan, Muneeb Ul

arXiv.org Artificial Intelligence

Globally, individuals and organizations employ Quick Response (QR) codes for swift and convenient communication. Leveraging this, cybercriminals embed falsify and misleading information in QR codes to launch various phishing attacks which termed as Quishing. Many former studies have introduced defensive approaches to preclude Quishing such as by classifying the embedded content of QR codes and then label the QR codes accordingly, whereas other studies classify them using visual features (i.e., deep features, histogram density analysis features). However, these approaches mainly rely on black-box techniques which do not clearly provide interpretability and transparency to fully comprehend and reproduce the intrinsic decision process; therefore, having certain obvious limitations includes the approaches' trust, accountability, issues in bias detection, and many more. We proposed QRïS, the pioneer method to classify QR codes through the comprehensive structural analysis of a QR code which helps to identify phishing QR codes beforehand. Our classification method is clearly transparent which makes it reproducible, scalable, and easy to comprehend. First, we generated QR codes dataset (i.e. 400,000 samples) using recently published URLs datasets [1], [2]. Then, unlike black-box models, we developed a simple algorithm to extract 24 structural features from layout patterns present in QR codes. Later, we train the machine learning models on the harvested features and obtained accuracy of up to 83.18%. To further evaluate the effectiveness of our approach, we perform the comparative analysis of proposed method with relevant contemporary studies. Lastly, for real-world deployment and validation, we developed a mobile app which assures the feasibility of the proposed solution in real-world scenarios which eventually strengthen the applicability of the study.


EduVidQA: Generating and Evaluating Long-form Answers to Student Questions based on Lecture Videos

Ray, Sourjyadip, Sharma, Shubham, Aditya, Somak, Goyal, Pawan

arXiv.org Artificial Intelligence

As digital platforms redefine educational paradigms, ensuring interactivity remains vital for effective learning. This paper explores using Multimodal Large Language Models (MLLMs) to automatically respond to student questions from online lectures - a novel question answering task of real world significance. We introduce the EduVidQA Dataset with 5252 question-answer pairs (both synthetic and real-world) from 296 computer science videos covering diverse topics and difficulty levels. To understand the needs of the dataset and task evaluation, we empirically study the qualitative preferences of students, which we provide as an important contribution to this line of work. Our benchmarking experiments consist of 6 state-of-the-art MLLMs, through which we study the effectiveness of our synthetic data for finetuning, as well as showing the challenging nature of the task. We evaluate the models using both text-based and qualitative metrics, thus showing a nuanced perspective of the models' performance, which is paramount to future work. This work not only sets a benchmark for this important problem, but also opens exciting avenues for future research in the field of Natural Language Processing for Education.


Better To Ask in English? Evaluating Factual Accuracy of Multilingual LLMs in English and Low-Resource Languages

Rohera, Pritika, Ginimav, Chaitrali, Sawant, Gayatri, Joshi, Raviraj

arXiv.org Artificial Intelligence

Multilingual Large Language Models (LLMs) have demonstrated significant effectiveness across various languages, particularly in high-resource languages such as English. However, their performance in terms of factual accuracy across other low-resource languages, especially Indic languages, remains an area of investigation. In this study, we assess the factual accuracy of LLMs - GPT-4o, Gemma-2-9B, Gemma-2-2B, and Llama-3.1-8B - by comparing their performance in English and Indic languages using the IndicQuest dataset, which contains question-answer pairs in English and 19 Indic languages. By asking the same questions in English and their respective Indic translations, we analyze whether the models are more reliable for regional context questions in Indic languages or when operating in English. Our findings reveal that LLMs often perform better in English, even for questions rooted in Indic contexts. Notably, we observe a higher tendency for hallucination in responses generated in low-resource Indic languages, highlighting challenges in the multilingual understanding capabilities of current LLMs.