FAME-MT Dataset: Formality Awareness Made Easy for Machine Translation Purposes
Wiśniewski, Dawid, Rostek, Zofia, Nowakowski, Artur
–arXiv.org Artificial Intelligence
People use language for various purposes. Apart from sharing information, individuals may use it to express emotions or to show respect for another person. In this paper, we focus on the formality level of machine-generated translations and present FAME-MT -- a dataset consisting of 11.2 million translations between 15 European source languages and 8 European target languages classified to formal and informal classes according to target sentence formality. This dataset can be used to fine-tune machine translation models to ensure a given formality level for each European target language considered. We describe the dataset creation procedure, the analysis of the dataset's quality showing that FAME-MT is a reliable source of language register information, and we present a publicly available proof-of-concept machine translation model that uses the dataset to steer the formality level of the translation. Currently, it is the largest dataset of formality annotations, with examples expressed in 112 European language pairs. The dataset is published online: https://github.com/laniqo-public/fame-mt/ .
arXiv.org Artificial Intelligence
May-20-2024
- Country:
- Oceania > Australia
- North America
- United States
- Washington > King County
- Seattle (0.04)
- Pennsylvania > Philadelphia County
- Philadelphia (0.04)
- New York > New York County
- New York City (0.04)
- Louisiana > Orleans Parish
- New Orleans (0.04)
- California > San Diego County
- San Diego (0.04)
- Washington > King County
- Canada > Ontario
- Toronto (0.04)
- United States
- Europe
- Switzerland (0.04)
- Portugal > Lisbon
- Lisbon (0.04)
- Poland > Greater Poland Province
- Poznań (0.04)
- Latvia > Riga Municipality
- Riga (0.04)
- France > Provence-Alpes-Côte d'Azur
- Bouches-du-Rhône > Marseille (0.04)
- Denmark > Capital Region
- Copenhagen (0.04)
- Croatia > Dubrovnik-Neretva County
- Dubrovnik (0.04)
- Belgium > Brussels-Capital Region
- Brussels (0.04)
- Asia
- North Korea (0.04)
- Middle East > Jordan (0.04)
- China > Hong Kong (0.04)
- Thailand > Phuket
- Phuket (0.04)
- Genre:
- Research Report (1.00)
- Industry:
- Law (0.46)
- Technology: