Adding Instructions during Pretraining: Effective Way of Controlling Toxicity in Language Models

Prabhumoye, Shrimai, Patwary, Mostofa, Shoeybi, Mohammad, Catanzaro, Bryan

Feb-14-2023–arXiv.org Artificial Intelligence

Pretrained large language models have become indispensable for solving various natural language processing (NLP) tasks. However, safely deploying them in real world applications is challenging because they generate toxic content. To address this challenge, we propose two novel pretraining data augmentation strategies that significantly reduce model toxicity without compromising its utility. Our two strategies are: (1) MEDA: adds raw toxicity score as meta-data to the pretraining samples, and (2) INST: adds instructions to those samples indicating their toxicity. Our results indicate that our best performing strategy (INST) substantially reduces the toxicity probability up to 61% while preserving the accuracy Figure 1: Overview of the proposed approaches and the on five benchmark NLP tasks as well as baseline (BASE). We propose two new data augmentation improving AUC scores on four bias detection strategies, MEDA and INST. The text in purple are tasks by 1.3%. We also demonstrate the generalizability control variables indicating the desired toxicity level of of our techniques by scaling the the text. The text in black is the input to the model number of training samples and the number of and the text in green is the generated output using each model parameters.

large language model, machine learning, natural language, (17 more...)

arXiv.org Artificial Intelligence

Feb-14-2023

arXiv.org PDF

Add feedback

Country:
- North America > United States
  - New York > New York County > New York City (0.04)
- Europe
  - Italy > Calabria
    - Catanzaro Province > Catanzaro (0.04)
  - Germany > Saarland
    - Saarbrücken (0.04)

Genre:
- Research Report > New Finding (0.34)

Industry:
- Health & Medicine (0.75)

Technology:
- Information Technology > Artificial Intelligence
  - Natural Language > Large Language Model (0.66)
  - Machine Learning > Performance Analysis
    - Accuracy (0.34)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found