Mishra, Abhijit
ReVision: A Dataset and Baseline VLM for Privacy-Preserving Task-Oriented Visual Instruction Rewriting
Mishra, Abhijit, Noh, Richard, Fu, Hsiang, Li, Mingda, Kim, Minji
Efficient and privacy-preserving multimodal interaction is essential as AR, VR, and modern smartphones with powerful cameras become primary interfaces for human-computer communication. Existing powerful large vision-language models (VLMs) enabling multimodal interaction often rely on cloud-based processing, raising significant concerns about (1) visual privacy by transmitting sensitive vision data to servers, and (2) their limited real-time, on-device usability. This paper explores Visual Instruction Rewriting, a novel approach that transforms multimodal instructions into text-only commands, allowing seamless integration of lightweight on-device instruction rewriter VLMs (250M parameters) with existing conversational AI systems, enhancing vision data privacy. To achieve this, we present a dataset of over 39,000 examples across 14 domains and develop a compact VLM, pretrained on image captioning datasets and fine-tuned for instruction rewriting. Experimental results, evaluated through NLG metrics such as BLEU, METEOR, and ROUGE, along with semantic parsing analysis, demonstrate that even a quantized version of the model (<500MB storage footprint) can achieve effective instruction rewriting, thus enabling privacy-focused, multimodal AI applications.
A Survey on Bridging EEG Signals and Generative AI: From Image and Text to Beyond
Shukla, Shreya, Torres, Jose, Mishra, Abhijit, Gwizdka, Jacek, Roychowdhury, Shounak
Integration of Brain-Computer Interfaces (BCIs) and Generative Artificial Intelligence (GenAI) has opened new frontiers in brain signal decoding, enabling assistive communication, neural representation learning, and multimodal integration. BCIs, particularly those leveraging Electroencephalography (EEG), provide a non-invasive means of translating neural activity into meaningful outputs. Recent advances in deep learning, including Generative Adversarial Networks (GANs) and Transformer-based Large Language Models (LLMs), have significantly improved EEG-based generation of images, text, and speech. This paper provides a literature review of the state-of-the-art in EEG-based multimodal generation, focusing on (i) EEG-to-image generation through GANs, Variational Autoencoders (VAEs), and Diffusion Models, and (ii) EEG-to-text generation leveraging Transformer based language models and contrastive learning methods. Additionally, we discuss the emerging domain of EEG-to-speech synthesis, an evolving multimodal frontier. We highlight key datasets, use cases, challenges, and EEG feature encoding methods that underpin generative approaches. By providing a structured overview of EEG-based generative AI, this survey aims to equip researchers and practitioners with insights to advance neural decoding, enhance assistive technologies, and expand the frontiers of brain-computer interaction.
ETF: An Entity Tracing Framework for Hallucination Detection in Code Summaries
Maharaj, Kishan, Munigala, Vitobha, Tamilselvam, Srikanth G., Kumar, Prince, Sen, Sayandeep, Kodeswaran, Palani, Mishra, Abhijit, Bhattacharyya, Pushpak
Recent advancements in large language models (LLMs) have significantly enhanced their ability to understand both natural language and code, driving their use in tasks like natural language-to-code (NL2Code) and code summarization. However, LLMs are prone to hallucination-outputs that stray from intended meanings. Detecting hallucinations in code summarization is especially difficult due to the complex interplay between programming and natural languages. We introduce a first-of-its-kind dataset with $\sim$10K samples, curated specifically for hallucination detection in code summarization. We further propose a novel Entity Tracing Framework (ETF) that a) utilizes static program analysis to identify code entities from the program and b) uses LLMs to map and verify these entities and their intents within generated code summaries. Our experimental analysis demonstrates the effectiveness of the framework, leading to a 0.73 F1 score. This approach provides an interpretable method for detecting hallucinations by grounding entities, allowing us to evaluate summary accuracy.
Thought2Text: Text Generation from EEG Signal using Large Language Models (LLMs)
Mishra, Abhijit, Shukla, Shreya, Torres, Jose, Gwizdka, Jacek, Roychowdhury, Shounak
Decoding and expressing brain activity in a comprehensible form is a challenging frontier in AI. This paper presents Thought2Text, which uses instruction-tuned Large Language Models (LLMs) fine-tuned with EEG data to achieve this goal. The approach involves three stages: (1) training an EEG encoder for visual feature extraction, (2) fine-tuning LLMs on image and text data, enabling multimodal description generation, and (3) further fine-tuning on EEG embeddings to generate text directly from EEG during inference. Experiments on a public EEG dataset collected for six subjects with image stimuli demonstrate the efficacy of multimodal LLMs (LLaMa-v3, Mistral-v0.3, Qwen2.5), validated using traditional language generation evaluation metrics, GPT-4 based assessments, and evaluations by human expert. This approach marks a significant advancement towards portable, low-cost "thoughts-to-text" technology with potential applications in both neuroscience and natural language processing (NLP).
SentinelLMs: Encrypted Input Adaptation and Fine-tuning of Language Models for Private and Secure Inference
Mishra, Abhijit, Li, Mingda, Deo, Soham
This paper addresses the privacy and security concerns associated with deep neural language models, which serve as crucial components in various modern AI-based applications. These models are often used after being pre-trained and fine-tuned for specific tasks, with deployment on servers accessed through the internet. However, this introduces two fundamental risks: (a) the transmission of user inputs to the server via the network gives rise to interception vulnerabilities, and (b) privacy concerns emerge as organizations that deploy such models store user data with restricted context. To address this, we propose a novel method to adapt and fine-tune transformer-based language models on passkey-encrypted user-specific text. The original pre-trained language model first undergoes a quick adaptation (without any further pre-training) with a series of irreversible transformations applied to the tokenizer and token embeddings. This enables the model to perform inference on encrypted inputs while preventing reverse engineering of text from model parameters and intermediate outputs. After adaptation, models are fine-tuned on encrypted versions of existing training datasets. Experimental evaluation employing adapted versions of renowned models (e.g., BERT, RoBERTa) across established benchmark English and multilingual datasets for text classification and sequence labeling shows that encrypted models achieve performance parity with their original counterparts. This serves to safeguard performance, privacy, and security cohesively.
Can Open Domain Question Answering Systems Answer Visual Knowledge Questions?
Zhang, Jiawen, Mishra, Abhijit, S, Avinesh P. V., Patwardhan, Siddharth, Agarwal, Sachin
The task of Outside Knowledge Visual Question Answering (OKVQA) requires an automatic system to answer natural language questions about pictures and images using external knowledge. We observe that many visual questions, which contain deictic referential phrases referring to entities in the image, can be rewritten as "non-grounded" questions and can be answered by existing text-based question answering systems. This allows for the reuse of existing text-based Open Domain Question Answering (QA) Systems for visual question answering. In this work, we propose a potentially data-efficient approach that reuses existing systems for (a) image analysis, (b) question rewriting, and (c) text-based question answering to answer such visual questions. Given an image and a question pertaining to that image (a visual question), we first extract the entities present in the image using pre-trained object and scene classifiers. Using these detected entities, the visual questions can be rewritten so as to be answerable by open domain QA systems. We explore two rewriting strategies: (1) an unsupervised method using BERT for masking and rewriting, and (2) a weakly supervised approach that combines adaptive rewriting and reinforcement learning techniques to use the implicit feedback from the QA system. We test our strategies on the publicly available OKVQA dataset and obtain a competitive performance with state-of-the-art models while using only 10% of the training data.
Template Controllable keywords-to-text Generation
Mishra, Abhijit, Chowdhury, Md Faisal Mahbub, Manohar, Sagar, Gutfreund, Dan, Sankaranarayanan, Karthik
This paper proposes a novel neural model for the understudied task of generating text from keywords. The model takes as input a set of un-ordered keywords, and part-of-speech (POS) based template instructions. This makes it ideal for surface realization in any NLG setup. The framework is based on the encode-attend-decode paradigm, where keywords and templates are encoded first, and the decoder judiciously attends over the contexts derived from the encoded keywords and templates to generate the sentences. Training exploits weak supervision, as the model trains on a large amount of labeled data with keywords and POS based templates prepared through completely automatic means. Qualitative and quantitative performance analyses on publicly available test-data in various domains reveal our system's superiority over baselines, built using state-of-the-art neural machine translation and controllable transfer techniques. Our approach is indifferent to the order of input keywords.
Modeling Topical Coherence in Discourse without Supervision
Shrivastava, Disha, Mishra, Abhijit, Sankaranarayanan, Karthik
Coherence of text is an important attribute to be measured for both manually and automatically generated discourse; but well-defined quantitative metrics for it are still elusive. In this paper, we present a metric for scoring topical coherence of an input paragraph on a real-valued scale by analyzing its underlying topical structure. We first extract all possible topics that the sentences of a paragraph of text are related to. Coherence of this text is then measured by computing: (a) the degree of uncertainty of the topics with respect to the paragraph, and (b) the relatedness between these topics. All components of our modular framework rely only on unlabeled data and WordNet, thus making it completely unsupervised, which is an important feature for general-purpose usage of any metric. Experiments are conducted on two datasets - a publicly available dataset for essay grading (representing human discourse), and a synthetic dataset constructed by mixing content from multiple paragraphs covering diverse topics. Our evaluation shows that the measured coherence scores are positively correlated with the ground truth for both the datasets. Further validation to our coherence scores is provided by conducting human evaluation on the synthetic data, showing a significant agreement of 79.3%
Cognition-Cognizant Sentiment Analysis With Multitask Subjectivity Summarization Based on Annotators' Gaze Behavior
Mishra, Abhijit (IBM Research AI ) | Tamilselvam, Srikanth (IBM Research AI ) | Dasgupta, Riddhiman (IBM Research AI ) | Nagar, Seema (IBM Research AI ) | Dey, Kuntal (IBM Research AI )
For document level sentiment analysis (SA), Subjectivity Extraction, ie., extracting the relevant subjective portions of the text that cover the overall sentiment expressed in the document, is an important step. Subjectivity Extraction, however, is a hard problem for systems, as it demands a great deal of world knowledge and reasoning. Humans, on the other hand, are good at extracting relevant subjective summaries from an opinionated document (say, a movie review), while inferring the sentiment expressed in it. This capability is manifested in their eye-movement behavior while reading: words pertaining to the subjective summary of the text attract a lot more attention in the form of gaze-fixations and/or saccadic patterns. We propose a multi-task deep neural framework for document level sentiment analysis that learns to predict the overall sentiment expressed in the given input document, by simultaneously learning to predict human gaze behavior and auxiliary linguistic tasks like part-of-speech and syntactic properties of words in the document. For this, a multi-task learning algorithm based on multi-layer shared LSTM augmented with task specific classifiers is proposed. With this composite multi-task network, we obtain performance competitive with or better than state-of-the-art approaches in SA. Moreover, the availability of gaze predictions as an auxiliary output helps interpret the system better; for instance, gaze predictions reveal that the system indeed performs subjectivity extraction better, which accounts for improvement in document level sentiment analysis performance.
Scanpath Complexity: Modeling Reading Effort Using Gaze Information
Mishra, Abhijit (Indian Institute of Technology Bombay) | Kanojia, Diptesh (IITB-Monash Research Academy, India) | Nagar, Seema (IBM Research, India) | Dey, Kuntal (IBM Research, India) | Bhattacharyya, Pushpak (Indian Institute of Technology Bombay)
Measuring reading effort is useful for practical purposes such as designing learning material and personalizing text comprehension environment. We propose a quantification of reading effort by measuring the complexity of eye-movement patterns of readers. We call the measure Scanpath Complexity. Scanpath complexity is modeled as a function of various properties of gaze fixations and saccades- the basic parameters of eye movement behavior. We demonstrate the effectiveness of our scanpath complexity measure by showing that its correlation with different measures of lexical and syntactic complexity as well as standard readability metrics is better than popular baseline measures based on fixation alone.