Hardmeier, Christian
What Are They Filtering Out? A Survey of Filtering Strategies for Harm Reduction in Pretraining Datasets
Stranisci, Marco Antonio, Hardmeier, Christian
Data filtering strategies are a crucial component to develop safe Large Language Models (LLM), since they support the removal of harmful contents from pretraining datasets. There is a lack of research on the actual impact of these strategies on vulnerable groups to discrimination, though, and their effectiveness has not been yet systematically addressed. In this paper we present a benchmark study of data filtering strategies for harm reduction aimed at providing a systematic overview on these approaches. We survey 55 technical reports of English LMs and LLMs to identify the existing filtering strategies in literature and implement an experimental setting to test their impact against vulnerable groups. Our results show that the positive impact that strategies have in reducing harmful contents from documents has the side effect of increasing the underrepresentation of vulnerable groups to discrimination in datasets.
A comparison of data filtering techniques for English-Polish LLM-based machine translation in the biomedical domain
Lérida, Jorge del Pozo, Kojs, Kamil, Máté, János, Barański, Mikołaj Antoni, Hardmeier, Christian
Large Language Models (LLMs) have become state-of-the-art in Machine Translation (MT), often trained on massive bilingual parallel corpora scraped from the web, that contain low-quality entries and redundant information, leading to significant computational challenges. Various data filtering methods exist to reduce dataset sizes, but their effectiveness largely varies based on specific language pairs and domains. This paper evaluates the impact of commonly used data filtering techniques, such as LASER, MUSE, and LaBSE, on English-Polish translation within the biomedical domain. By filtering the UFAL Medical Corpus, we created varying dataset sizes to fine-tune the mBART50 model, which was then evaluated using the SacreBLEU metric on the Khresmoi dataset, having the quality of translations assessed by bilingual speakers. Our results show that both LASER and MUSE can significantly reduce dataset sizes while maintaining or even enhancing performance. We recommend the use of LASER, as it consistently outperforms the other methods and provides the most fluent and natural-sounding translations.
Mention Attention for Pronoun Translation
Tang, Gongbo, Hardmeier, Christian
Most pronouns are referring expressions, computers need to resolve what do the pronouns refer to, and there are divergences on pronoun usage across languages. Thus, dealing with these divergences and translating pronouns is a challenge in machine translation. Mentions are referring candidates of pronouns and have closer relations with pronouns compared to general tokens. We assume that extracting additional mention features can help pronoun translation. Therefore, we introduce an additional mention attention module in the decoder to pay extra attention to source mentions but not non-mention tokens. Our mention attention module not only extracts features from source mentions, but also considers target-side context which benefits pronoun translation. In addition, we also introduce two mention classifiers to train models to recognize mentions, whose outputs guide the mention attention. We conduct experiments on the WMT17 English-German translation task, and evaluate our models on general translation and pronoun translation, using BLEU, APT, and contrastive evaluation metrics. Our proposed model outperforms the baseline Transformer model in terms of APT and BLEU scores, this confirms our hypothesis that we can improve pronoun translation by paying additional attention to source mentions, and shows that our introduced additional modules do not have negative effect on the general translation quality.
A Dataset for the Detection of Dehumanizing Language
Engelmann, Paul, Trolle, Peter Brunsgaard, Hardmeier, Christian
Dehumanization can range from and Haslam (2006), where a sample is considered blatant to subtle forms of varying degrees (Bain dehumanizing if it contains at least one of the following et al., 2009), making automated, general detection categories: negative evaluation of a target difficult. Mendelsohn et al. (2020) present one of group, denial of agency, moral disgust, animal the first computational works on dehumanization metaphors, objectification. Animal metaphors and through explicit feature engineering, using lexicon objectification specifically relate to a human being and word embedding based approaches to detect compared to an animal or object with the intent dehumanizing associations across several years in to cause harm. Trigger Warning: This paper contains a New York Times corpus. Outside of this, there is examples of hateful content that some may little computational work on dehumanization.
Parallel Data Helps Neural Entity Coreference Resolution
Tang, Gongbo, Hardmeier, Christian
Coreference resolution is the task of finding expressions that refer to the same entity in a text. Coreference models are generally trained on monolingual annotated data but annotating coreference is expensive and challenging. Hardmeier et al.(2013) have shown that parallel data contains latent anaphoric knowledge, but it has not been explored in end-to-end neural models yet. In this paper, we propose a simple yet effective model to exploit coreference knowledge from parallel data. In addition to the conventional modules learning coreference from annotations, we introduce an unsupervised module to capture cross-lingual coreference knowledge. Our proposed cross-lingual model achieves consistent improvements, up to 1.74 percentage points, on the OntoNotes 5.0 English dataset using 9 different synthetic parallel datasets. These experimental results confirm that parallel data can provide additional coreference knowledge which is beneficial to coreference resolution tasks.
Findings of the 2016 WMT Shared Task on Cross-lingual Pronoun Prediction
Guillou, Liane, Hardmeier, Christian, Nakov, Preslav, Stymne, Sara, Tiedemann, Jörg, Versley, Yannick, Cettolo, Mauro, Webber, Bonnie, Popescu-Belis, Andrei
We describe the design, the evaluation setup, and the results of the 2016 WMT shared task on cross-lingual pronoun prediction. This is a classification task in which participants are asked to provide predictions on what pronoun class label should replace a placeholder value in the target-language text, provided in lemma-tised and PoS-tagged form. We provided four subtasks, for the English-French and English-German language pairs, in both directions. Eleven teams participated in the shared task; nine for the English-French subtask, five for French-English, nine for English-German, and six for German-English. Most of the submissions outperformed two strong language-model- based baseline systems, with systems using deep recurrent neural networks outperforming those using other architectures for most language pairs.