instructional prompt
Speaking at the Right Level: Literacy-Controlled Counterspeech Generation with RAG-RL
Song, Xiaoying, Anik, Anirban Saha, Barua, Dibakar, Luo, Pengcheng, Ding, Junhua, Hong, Lingzi
Health misinformation spreading online poses a significant threat to public health. Researchers have explored methods for automatically generating counterspeech to health misinformation as a mitigation strategy. Existing approaches often produce uniform responses, ignoring that the health literacy level of the audience could affect the accessibility and effectiveness of counterspeech. We propose a Controlled-Literacy framework using retrieval-augmented generation (RAG) with reinforcement learning (RL) to generate tailored counterspeech adapted to different health literacy levels. In particular, we retrieve knowledge aligned with specific health literacy levels, enabling accessible and factual information to support generation. We design a reward function incorporating subjective user preferences and objective readability-based rewards to optimize counterspeech to the target health literacy level. Experiment results show that Controlled-Literacy outperforms baselines by generating more accessible and user-preferred counterspeech. This research contributes to more equitable and impactful public health communication by improving the accessibility and comprehension of counterspeech to health misinformation
A Dynamic Fusion Model for Consistent Crisis Response
Song, Xiaoying, Anik, Anirban Saha, Blanco, Eduardo, Frias-Martinez, Vanessa, Hong, Lingzi
In response to the urgent need for effective communication with crisis-affected populations, automated responses driven by language models have been proposed to assist in crisis communications. A critical yet often overlooked factor is the consistency of response style, which could affect the trust of affected individuals in responders. Despite its importance, few studies have explored methods for maintaining stylistic consistency across generated responses. To address this gap, we propose a novel metric for evaluating style consistency and introduce a fusion-based generation approach grounded in this metric. Our method employs a two-stage process: it first assesses the style of candidate responses and then optimizes and integrates them at the instance level through a fusion process. This enables the generation of high-quality responses while significantly reducing stylistic variation between instances. Experimental results across multiple datasets demonstrate that our approach consistently outperforms baselines in both response quality and stylistic uniformity.
AIP: Subverting Retrieval-Augmented Generation via Adversarial Instructional Prompt
Chaturvedi, Saket S., Bagwe, Gaurav, Zhang, Lan, Yuan, Xiaoyong
Retrieval-Augmented Generation (RAG) enhances large language models (LLMs) by retrieving relevant documents from external sources to improve factual accuracy and verifiability. However, this reliance introduces new attack surfaces within the retrieval pipeline, beyond the LLM itself. While prior RAG attacks have exposed such vulnerabilities, they largely rely on manipulating user queries, which is often infeasible in practice due to fixed or protected user inputs. This narrow focus overlooks a more realistic and stealthy vector: instructional prompts, which are widely reused, publicly shared, and rarely audited. Their implicit trust makes them a compelling target for adversaries to manipulate RAG behavior covertly. We introduce a novel attack for Adversarial Instructional Prompt (AIP) that exploits adversarial instructional prompts to manipulate RAG outputs by subtly altering retrieval behavior. By shifting the attack surface to the instructional prompts, AIP reveals how trusted yet seemingly benign interface components can be weaponized to degrade system integrity. The attack is crafted to achieve three goals: (1) naturalness, to evade user detection; (2) utility, to encourage use of prompts; and (3) robustness, to remain effective across diverse query variations. We propose a diverse query generation strategy that simulates realistic linguistic variation in user queries, enabling the discovery of prompts that generalize across paraphrases and rephrasings. Building on this, a genetic algorithm-based joint optimization is developed to evolve adversarial prompts by balancing attack success, clean-task utility, and stealthiness. Experimental results show that AIP achieves up to 95.23% ASR while preserving benign functionality. These findings uncover a critical and previously overlooked vulnerability in RAG systems, emphasizing the need to reassess the shared instructional prompts.
Think or Step-by-Step? UnZIPping the Black Box in Zero-Shot Prompts
Sadr, Nikta Gohari, Madhusudan, Sangmitra, Emami, Ali
Zero-shot prompting techniques have significantly improved the performance of Large Language Models (LLMs). However, we lack a clear understanding of why zero-shot prompts are so effective. For example, in the prompt "Let's think step-by-step," is "think" or "step-by-step" more crucial to its success? Existing interpretability methods, such as gradient-based and attention-based approaches, are computationally intensive and restricted to open-source models. We introduce the ZIP score (Zero-shot Importance of Perturbation score), a versatile metric applicable to both open and closed-source models, based on systematic input word perturbations. Our experiments across four recent LLMs, seven widely-used prompts, and several tasks, reveal interesting patterns in word importance. For instance, while both 'step-by-step' and 'think' show high ZIP scores, which one is more influential depends on the model and task. We validate our method using controlled experiments and compare our results with human judgments, finding that proprietary models align more closely with human intuition regarding word significance. These findings enhance our understanding of LLM behavior and contribute to developing more effective zero-shot prompts and improved model analysis.
Red Teaming Language Model Detectors with Language Models
Shi, Zhouxing, Wang, Yihan, Yin, Fan, Chen, Xiangning, Chang, Kai-Wei, Hsieh, Cho-Jui
The prevalence and strong capability of large language models (LLMs) present significant safety and ethical risks if exploited by malicious users. To prevent the potentially deceptive usage of LLMs, recent works have proposed algorithms to detect LLM-generated text and protect LLMs. In this paper, we investigate the robustness and reliability of these LLM detectors under adversarial attacks. We study two types of attack strategies: 1) replacing certain words in an LLM's output with their synonyms given the context; 2) automatically searching for an instructional prompt to alter the writing style of the generation. In both strategies, we leverage an auxiliary LLM to generate the word replacements or the instructional prompt. Different from previous works, we consider a challenging setting where the auxiliary LLM can also be protected by a detector. Experiments reveal that our attacks effectively compromise the performance of all detectors in the study with plausible generations, underscoring the urgent need to improve the robustness of LLM-generated text detection systems.
EDM3: Event Detection as Multi-task Text Generation
Anantheswaran, Ujjwala, Gupta, Himanshu, Parmar, Mihir, Pal, Kuntal Kumar, Baral, Chitta
Event detection refers to identifying event occurrences in a text and comprises of two subtasks; event identification and classification. We present EDM3, a novel approach for Event Detection that formulates three generative tasks: identification, classification, and combined detection. We show that EDM3 helps to learn transferable knowledge that can be leveraged to perform Event Detection and its subtasks concurrently, mitigating the error propagation inherent in pipelined approaches. Unlike previous dataset- or domain-specific approaches, EDM3 utilizes the existing knowledge of language models, allowing it to be trained over any classification schema. We evaluate EDM3 on multiple event detection datasets: RAMS, WikiEvents, MAVEN, and MLEE, showing that EDM3 outperforms 1) single-task performance by 8.4% on average and 2) multi-task performance without instructional prompts by 2.4% on average. We obtain SOTA results on RAMS (71.3% vs. 65.1% F-1) and competitive performance on other datasets. We analyze our approach to demonstrate its efficacy in low-resource and multi-sentence settings. We also show the effectiveness of this approach on non-standard event configurations such as multi-word and multi-class event triggers. Overall, our results show that EDM3 is a promising approach for Event Detection that has the potential for real-world applications.