Language Model Unalignment: Parametric Red-Teaming to Expose Hidden Harms and Biases
Bhardwaj, Rishabh, Poria, Soujanya
–arXiv.org Artificial Intelligence
Red-teaming has been a widely adopted way to evaluate the harmful behavior of Large Language Models (LLMs). It aims to jailbreak a model's safety behavior to make it act as a helpful agent disregarding the harmfulness of the query. Existing methods are primarily based on input text-based red-teaming such as adversarial prompts, low-resource prompts, or contextualized prompts to condition the model in a way to bypass its safety guardrails. An effective jailbreak has the potential to uncover hidden harmful information and biases in the model that are left untreated or newly introduced by its safety training. However, prompt-based attacks fail to provide such a diagnosis owing to their low attack success rate and applicability to specific models. It simply (instruction) tunes the model parameters to break its guardrails that are not deeply rooted in the model's behavior. GPT to the point where it responds with an 88% success rate to harmful queries from two safety benchmark datasets. Large Language Models (LLMs) have shown emerging zero-shot capabilities with an increase in size (Wei et al., 2022; Brown et al., 2020) i.e., beyond a point where quantitative changes lead to qualitative changes in the model. As exciting as it is to observe the significance (utility) of such models to people, an adversary can find these systems highly useful to achieve a malicious goal. Moreover, such systems tend to inherit biases from humans through the datasets used for their construction. Thus, before deploying these systems for wide public use, it is important to make them unharmful and unbiased while maintaining their generic utility.
arXiv.org Artificial Intelligence
Nov-13-2023
- Country:
- North America > United States (0.28)
- Genre:
- Research Report (0.40)
- Industry:
- Government (0.47)
- Health & Medicine (0.46)
- Information Technology > Security & Privacy (0.46)
- Law Enforcement & Public Safety (0.67)
- Technology: