Let the Models Respond: Interpreting Language Model Detoxification Through the Lens of Prompt Dependence

Open in new window