Patna
Talk, Snap, Complain: Validation-Aware Multimodal Expert Framework for Fine-Grained Customer Grievances
Singh, Rishu Kumar, Shreya, Navneet, Das, Sarmistha, Singh, Apoorva, Saha, Sriparna
Existing approaches to complaint analysis largely rely on unimodal, short-form content such as tweets or product reviews. This work advances the field by leveraging multimodal, multi-turn customer support dialogues, where users often share both textual complaints and visual evidence (e.g., screenshots, product photos) to enable fine-grained classification of complaint aspects and severity. We introduce VALOR, a Validation-Aware Learner with Expert Routing, tailored for this multimodal setting. It employs a multi-expert reasoning setup using large-scale generative models with Chain-of-Thought (CoT) prompting for nuanced decision-making. To ensure coherence between modalities, a semantic alignment score is computed and integrated into the final classification through a meta-fusion strategy. In alignment with the United Nations Sustainable Development Goals (UN SDGs), the proposed framework supports SDG 9 (Industry, Innovation and Infrastructure) by advancing AI-driven tools for robust, scalable, and context-aware service infrastructure. Further, by enabling structured analysis of complaint narratives and visual context, it contributes to SDG 12 (Responsible Consumption and Production) by promoting more responsive product design and improved accountability in consumer services. We evaluate VALOR on a curated multimodal complaint dataset annotated with fine-grained aspect and severity labels, showing that it consistently outperforms baseline models, especially in complex complaint scenarios where information is distributed across text and images. This study underscores the value of multimodal interaction and expert validation in practical complaint understanding systems. Resources related to data and codes are available here: https://github.com/sarmistha-D/VALOR
- North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
- Europe > Austria > Vienna (0.14)
- Asia > India > Bihar > Patna (0.04)
- (7 more...)
- Research Report > New Finding (0.46)
- Research Report > Experimental Study (0.46)
Crossing Borders: A Multimodal Challenge for Indian Poetry Translation and Image Generation
Jamil, Sofia, Charan, Kotla Sai, Saha, Sriparna, Goswami, Koustava, J, Joseph K
Indian poetry, known for its linguistic complexity and deep cultural resonance, has a rich and varied heritage spanning thousands of years. However, its layered meanings, cultural allusions, and sophisticated grammatical constructions often pose challenges for comprehension, especially for non-native speakers or readers unfamiliar with its context and language. Despite its cultural significance, existing works on poetry have largely overlooked Indian language poems. In this paper, we propose the Translation and Image Generation (TAI) framework, leveraging Large Language Models (LLMs) and Latent Diffusion Models through appropriate prompt tuning. Our framework supports the United Nations Sustainable Development Goals of Quality Education (SDG 4) and Reduced Inequalities (SDG 10) by enhancing the accessibility of culturally rich Indian-language poetry to a global audience. It includes (1) a translation module that uses an Odds Ratio Preference Alignment Algorithm to accurately translate morphologically rich poetry into English, and (2) an image generation module that employs a semantic graph to capture tokens, dependencies, and semantic relationships between metaphors and their meanings, to create visually meaningful representations of Indian poems. Our comprehensive experimental evaluation, including both human and quantitative assessments, demonstrates the superiority of TAI Diffusion in poem image generation tasks, outperforming strong baselines. To further address the scarcity of resources for Indian-language poetry, we introduce the Morphologically Rich Indian Language Poems MorphoVerse Dataset, comprising 1,570 poems across 21 low-resource Indian languages. By addressing the gap in poetry translation and visual comprehension, this work aims to broaden accessibility and enrich the reader's experience.
- Asia > China (0.04)
- North America > United States > Florida > Miami-Dade County > Miami (0.04)
- Europe > Bulgaria > Varna Province > Varna (0.04)
- (2 more...)
Protein Secondary Structure Prediction Using 3D Graphs and Relation-Aware Message Passing Transformers
Varshney, Disha, Garg, Samarth, Tyagi, Sarthak, Varshney, Deeksha, Deep, Nayan, Ekbal, Asif
In this study, we tackle the challenging task of predicting secondary structures from protein primary sequences, a pivotal initial stride towards predicting tertiary structures, while yielding crucial insights into protein activity, relationships, and functions. Existing methods often utilize extensive sets of unlabeled amino acid sequences. However, these approaches neither explicitly capture nor harness the accessible protein 3D structural data, which is recognized as a decisive factor in dictating protein functions. To address this, we utilize protein residue graphs and introduce various forms of sequential or structural connections to capture enhanced spatial information. We adeptly combine Graph Neural Networks (GNNs) and Language Models (LMs), specifically utilizing a pre-trained transformer-based protein language model to encode amino acid sequences and employing message-passing mechanisms like GCN and R-GCN to capture geometric characteristics of protein structures. Employing convolution within a specific node's nearby region, including relations, we stack multiple con-volutional layers to efficiently learn combined insights from the protein's spatial graph, revealing intricate interconnections and dependencies in its structural To assess our model's performance, we employed the training dataset provided by NetSurfP-2.0, which outlines secondary structure in 3-and 8-states. Extensive experiments show that our proposed model, SSRGNet surpasses the baseline on f1-scores. Introduction Proteins serve as essential components within cells and are involved in various applications, spanning from therapeutics to materials. They are composed of a sequence of amino acids that fold into distinct shapes. With the development of affordable sequencing technologies [1, 2], a substantial number of novel protein sequences have been identified in recent times. However, annotating the functional properties of a newly discovered protein sequence is still a laborious and expensive process. Thus, there is a need for reliable and efficient computational methods to accurately predict and assign functions to proteins, thereby bridging the gap between sequence information and functional knowledge. The analysis of protein structure, particularly the tertiary structure, is highly significant for practical applications related to proteins, such as understanding their functions and designing drugs [3].
- North America > United States > Indiana (0.04)
- Europe > Greece (0.04)
- Asia > India > Tamil Nadu > Chennai (0.04)
- (2 more...)
- Health & Medicine > Pharmaceuticals & Biotechnology (1.00)
- Education > Health & Safety > School Nutrition (0.80)
SANSKRITI: A Comprehensive Benchmark for Evaluating Language Models' Knowledge of Indian Culture
Maji, Arijit, Kumar, Raghvendra, Ghosh, Akash, Anushka, null, Saha, Sriparna
Language Models (LMs) are indispensable tools shaping modern workflows, but their global effectiveness depends on understanding local socio-cultural contexts. To address this, we introduce SANSKRITI, a benchmark designed to evaluate language models' comprehension of India's rich cultural diversity. Comprising 21,853 meticulously curated question-answer pairs spanning 28 states and 8 union territories, SANSKRITI is the largest dataset for testing Indian cultural knowledge. It covers sixteen key attributes of Indian culture: rituals and ceremonies, history, tourism, cuisine, dance and music, costume, language, art, festivals, religion, medicine, transport, sports, nightlife, and personalities, providing a comprehensive representation of India's cultural tapestry. We evaluate SANSKRITI on leading Large Language Models (LLMs), Indic Language Models (ILMs), and Small Language Models (SLMs), revealing significant disparities in their ability to handle culturally nuanced queries, with many models struggling in region-specific contexts. By offering an extensive, culturally rich, and diverse dataset, SANSKRITI sets a new standard for assessing and improving the cultural understanding of LMs.
Adaptive Data-Resilient Multi-Modal Hierarchical Multi-Label Book Genre Identification
Nareti, Utsav Kumar, Chattopadhyay, Soumi, Mallick, Prolay, Kumar, Suraj, Adak, Chandranath, Daga, Ayush Vikas, Wase, Adarsh, Roy, Arjab
Identifying fine-grained book genres is essential for enhancing user experience through efficient discovery, personalized recommendations, and improved reader engagement. At the same time, it provides publishers and marketers with valuable insights into consumer preferences and emerging market trends. While traditional genre classification methods predominantly rely on textual reviews or content analysis, the integration of additional modalities, such as book covers, blurbs, and metadata, offers richer contextual cues. However, the effectiveness of such multi-modal systems is often hindered by incomplete, noisy, or missing data across modalities. To address this, we propose IMAGINE (Intelligent Multi-modal Adaptive Genre Identification NEtwork), a framework designed to leverage multi-modal data while remaining robust to missing or unreliable information. IMAGINE learns modality-specific feature representations and adaptively prioritizes the most informative sources available at inference time. It further employs a hierarchical classification strategy, grounded in a curated taxonomy of book genres, to capture inter-genre relationships and support multi-label assignments reflective of real-world literary diversity. A key strength of IMAGINE is its adaptability: it maintains high predictive performance even when one modality, such as text or image, is unavailable. We also curated a large-scale hierarchical dataset that structures book genres into multiple levels of granularity, allowing for a more comprehensive evaluation. Experimental results demonstrate that IMAGINE outperformed strong baselines in various settings, with significant gains in scenarios involving incomplete modality-specific data.
On Non-interactive Evaluation of Animal Communication Translators
Paradise, Orr, Gruber, David F., Kalai, Adam Tauman
If you had an AI Whale-to-English translator, how could you validate whether or not it is working? Does one need to interact with the animals or rely on grounded observations such as temperature? We provide theoretical and proof-of-concept experimental evidence suggesting that interaction and even observations may not be necessary for sufficiently complex languages. One may be able to evaluate translators solely by their English outputs, offering potential advantages in terms of safety, ethics, and cost. This is an instance of machine translation quality evaluation (MTQE) without any reference translations available. A key challenge is identifying ``hallucinations,'' false translations which may appear fluent and plausible. We propose using segment-by-segment translation together with the classic NLP shuffle test to evaluate translators. The idea is to translate animal communication, turn by turn, and evaluate how often the resulting translations make more sense in order than permuted. Proof-of-concept experiments on data-scarce human languages and constructed languages demonstrate the potential utility of this evaluation methodology. These human-language experiments serve solely to validate our reference-free metric under data scarcity. It is found to correlate highly with a standard evaluation based on reference translations, which are available in our experiments. We also perform a theoretical analysis suggesting that interaction may not be necessary nor efficient in the early stages of learning to translate.
- Europe > Austria > Vienna (0.14)
- Asia > India > Bihar > Patna (0.04)
- North America > United States > Oregon > Multnomah County > Portland (0.04)
- (18 more...)
DeHate: A Stable Diffusion-based Multimodal Approach to Mitigate Hate Speech in Images
Dalal, Dwip, Vashishtha, Gautam, Rani, Anku, Reganti, Aishwarya, Patwa, Parth, Sarique, Mohd, Gupta, Chandan, Nath, Keshav, Reddy, Viswanatha, Jain, Vinija, Chadha, Aman, Das, Amitava, Sheth, Amit, Ekbal, Asif
The rise in harmful online content not only distorts public discourse but also poses significant challenges to maintaining a healthy digital environment. In response to this, we introduce a multimodal dataset uniquely crafted for identifying hate in digital content. Central to our methodology is the innovative application of watermarked, stability-enhanced, stable diffusion techniques combined with the Digital Attention Analysis Module (DAAM). This combination is instrumental in pinpointing the hateful elements within images, thereby generating detailed hate attention maps, which are used to blur these regions from the image, thereby removing the hateful sections of the image. We release this data set as a part of the dehate shared task. This paper also describes the details of the shared task. Furthermore, we present DeHater, a vision-language model designed for multimodal dehatification tasks. Our approach sets a new standard in AI-driven image hate detection given textual prompts, contributing to the development of more ethical AI applications in social media.
- North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
- North America > United States > South Carolina (0.04)
- North America > United States > Massachusetts > Middlesex County > Cambridge (0.04)
- (5 more...)
- Research Report (1.00)
- Overview > Innovation (0.48)
- Information Technology (0.68)
- Health & Medicine > Therapeutic Area (0.46)
PGMEL: Policy Gradient-based Generative Adversarial Network for Multimodal Entity Linking
Pooja, KM, Long, Cheng, Sun, Aixin
Abstract--The task of entity linking, which involves associating mentions with their respective entities in a knowledge graph, has received significant attention due to its numerous potential applications. Recently, various multimodal entity linking (MEL) techniques have been proposed, targeted to learn comprehensive embeddings by leveraging both text and vision modalities. The selection of high-quality negative samples can potentially play a crucial role in metric/representation learning. However, to the best of our knowledge, this possibility remains unexplored in existing literature within the framework of MEL. T o fill this gap, we address the multimodal entity linking problem in a generative adversarial setting where the generator is responsible for generating high-quality negative samples, and the discriminator is assigned the responsibility for the metric learning tasks. Since the generator is involved in generating samples, which is a discrete process, we optimize it using policy gradient techniques and propose a policy gradient-based generative adversarial network for multimodal entity linking (PGMEL). Experimental results based on Wiki-MEL, Richpedia-MEL and WikiDiverse datasets demonstrate that PGMEL learns meaningful representation by selecting challenging negative samples and outperforms state-of-the-art methods. The last few decades have seen unprecedented growth in data availability. However, the increasing data availability quickly becomes a liability rather than an asset due to the increased gap between data and information. Thus, information extraction (IE) techniques to retrieve knowledge/information from a large amount of data have received considerable attention recently. A knowledge graph (KG) is a structured information database that allows storing extracted information from a large amount of data for retrieval or reasoning at a later stage. Furthermore, the recent developments in IE techniques allow the automatic creation of large KGs with millions of entries from knowledge bases such as Wikipedia, DBpedia, Freebase, and Y AGO [1]. Automated KG construction is a complex task that involves various intricate subtasks, including (i) named entity recognition to identify and categorize named entities, like a person or geographic locations, etc., in text, (ii) co-reference resolution to group references of the same entity, (iii) relation extraction to establish relationships between the entities, and (iv) entity linking [2], [3]. KM Pooja is with the Department of Information Technology, Indian Institute of Information Technology, Allahabad India 211012. Cheng Long and Aixin Sun are with the School of Computer Science and Engineering, Nanyang Technological University, 50 Nanyang Ave, Singapore 639798.
Let's Play Across Cultures: A Large Multilingual, Multicultural Benchmark for Assessing Language Models' Understanding of Sports
Singh, Punit Kumar, Kumar, Nishant, Ghosh, Akash, Pasad, Kunal, Soni, Khushi, Jaishwal, Manisha, Saha, Sriparna, Alfarozi, Syukron Abu Ishaq, Abagissa, Asres Temam, Pasupa, Kitsuchart, Yang, Haiqin, Moreno, Jose G
Language Models (LMs) are primarily evaluated on globally popular sports, often overlooking regional and indigenous sporting traditions. To address this gap, we introduce \textbf{\textit{CultSportQA}}, a benchmark designed to assess LMs' understanding of traditional sports across 60 countries and 6 continents, encompassing four distinct cultural categories. The dataset features 33,000 multiple-choice questions (MCQs) across text and image modalities, each of which is categorized into three key types: history-based, rule-based, and scenario-based. To evaluate model performance, we employ zero-shot, few-shot, and chain-of-thought (CoT) prompting across a diverse set of Large Language Models (LLMs), Small Language Models (SLMs), and Multimodal Large Language Models (MLMs). By providing a comprehensive multilingual and multicultural sports benchmark, \textbf{\textit{CultSportQA}} establishes a new standard for assessing AI's ability to understand and reason about traditional sports.
- North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
- North America > Canada > Ontario > Toronto (0.14)
- Europe > Italy (0.05)
- (15 more...)
- Research Report (0.50)
- Overview (0.46)
- Questionnaire & Opinion Survey (0.34)
- Leisure & Entertainment > Sports (0.67)
- Education (0.66)
- Leisure & Entertainment > Games (0.46)