SynthDetoxM: Modern LLMs are Few-Shot Parallel Detoxification Data Annotators

Moskovskiy, Daniil, Sushko, Nikita, Pletenev, Sergey, Tutubalina, Elena, Panchenko, Alexander

Feb-10-2025–arXiv.org Artificial Intelligence

Existing approaches to multilingual text detoxification are hampered by the scarcity of parallel multilingual datasets. In this work, we introduce a pipeline for the generation of multilingual parallel detoxification data. We also introduce SynthDetoxM, a manually collected and synthetically generated multilingual parallel text detoxification dataset comprising 16,000 high-quality detoxification sentence pairs across German, French, Spanish and Russian. The data was sourced from different toxicity evaluation datasets and then rewritten with nine modern open-source LLMs in few-shot setting. Our experiments demonstrate that models trained on the produced synthetic datasets have superior performance to those trained on the human-annotated MultiParaDetox dataset even in data limited setting. Models trained on SynthDetoxM outperform all evaluated LLMs in few-shot setting. We release our dataset and code to help further research in multilingual text detoxification.

artificial intelligence, large language model, natural language, (18 more...)

arXiv.org Artificial Intelligence

Feb-10-2025

arXiv.org PDF

Add feedback

Country:
- Oceania > Australia
  - Victoria > Melbourne (0.04)
- North America
  - Dominican Republic (0.04)
  - United States
    - New York > New York County
      - New York City (0.04)
    - Louisiana > Orleans Parish
      - New Orleans (0.04)
    - Florida > Miami-Dade County
      - Miami (0.14)
    - California > Los Angeles County
      - Los Angeles (0.04)
  - Canada > Ontario
    - Toronto (0.04)
- Europe
  - Austria > Vienna (0.14)
  - Italy (0.04)
  - Sweden > Stockholm
    - Stockholm (0.04)
  - Spain > Catalonia
    - Barcelona Province > Barcelona (0.04)
  - Russia > Northwestern Federal District
    - Leningrad Oblast > Saint Petersburg (0.04)
  - Portugal > Lisbon
    - Lisbon (0.04)
  - Ireland > Leinster
    - County Dublin > Dublin (0.04)
  - France > Auvergne-Rhône-Alpes
    - Isère > Grenoble (0.04)
- Asia
  - Russia (0.14)
  - Singapore (0.04)
  - Indonesia > Bali (0.04)
  - China (0.04)
  - British Indian Ocean Territory > Diego Garcia (0.04)
  - Thailand > Bangkok
    - Bangkok (0.04)
  - Middle East > UAE
    - Abu Dhabi Emirate > Abu Dhabi (0.04)
  - Japan > Honshū
    - Kantō > Kanagawa Prefecture > Yokohama (0.04)

Genre:
- Research Report (0.81)

Industry:
- Government (0.46)

Technology:
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found