Exploring the Limits of Domain-Adaptive Training for Detoxifying Large-Scale Language Models
–Neural Information Processing Systems
Pre-trained language models (LMs) are shown to easily generate toxic language. In this work, we systematically explore domain-adaptive training to reduce the toxicity of language models. We conduct this study on three dimensions: training corpus, model size, and parameter efficiency. For the training corpus, we demonstrate that using self-generated datasets consistently outperforms the existing baselines across various model sizes on both automatic and human evaluations, even when it uses a 3 1 smaller training corpus. We then comprehensively study detoxifying LMs with parameter sizes ranging from 126M up to 530B (3 larger than GPT3), a scale that has never been studied before.
Neural Information Processing Systems
Jan-19-2025, 05:01:32 GMT