AITopics | Hardmeier, Christian

Collaborating Authors

Hardmeier, Christian

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

What Are They Filtering Out? A Survey of Filtering Strategies for Harm Reduction in Pretraining Datasets

Stranisci, Marco Antonio, Hardmeier, Christian

arXiv.org Artificial IntelligenceFeb-17-2025

Data filtering strategies are a crucial component to develop safe Large Language Models (LLM), since they support the removal of harmful contents from pretraining datasets. There is a lack of research on the actual impact of these strategies on vulnerable groups to discrimination, though, and their effectiveness has not been yet systematically addressed. In this paper we present a benchmark study of data filtering strategies for harm reduction aimed at providing a systematic overview on these approaches. We survey 55 technical reports of English LMs and LLMs to identify the existing filtering strategies in literature and implement an experimental setting to test their impact against vulnerable groups. Our results show that the positive impact that strategies have in reducing harmful contents from documents has the side effect of increasing the underrepresentation of vulnerable groups to discrimination in datasets.

arxiv preprint arxiv, large language model, machine learning, (18 more...)

arXiv.org Artificial Intelligence

2503.05721

Country:

Europe (0.67)
North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)

Genre: Research Report > New Finding (0.86)

Industry: Law Enforcement & Public Safety > Crime Prevention & Enforcement (0.61)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)

Add feedback

A comparison of data filtering techniques for English-Polish LLM-based machine translation in the biomedical domain

Lérida, Jorge del Pozo, Kojs, Kamil, Máté, János, Barański, Mikołaj Antoni, Hardmeier, Christian

arXiv.org Artificial IntelligenceJan-27-2025

Large Language Models (LLMs) have become state-of-the-art in Machine Translation (MT), often trained on massive bilingual parallel corpora scraped from the web, that contain low-quality entries and redundant information, leading to significant computational challenges. Various data filtering methods exist to reduce dataset sizes, but their effectiveness largely varies based on specific language pairs and domains. This paper evaluates the impact of commonly used data filtering techniques, such as LASER, MUSE, and LaBSE, on English-Polish translation within the biomedical domain. By filtering the UFAL Medical Corpus, we created varying dataset sizes to fine-tune the mBART50 model, which was then evaluated using the SacreBLEU metric on the Khresmoi dataset, having the quality of translations assessed by bilingual speakers. Our results show that both LASER and MUSE can significantly reduce dataset sizes while maintaining or even enhancing performance. We recommend the use of LASER, as it consistently outperforms the other methods and provides the most fluent and natural-sounding translations.

artificial intelligence, natural language, translation, (16 more...)

arXiv.org Artificial Intelligence

2501.16533

Country:

Europe (0.15)
North America > Canada (0.14)

Genre: Research Report > New Finding (1.00)

Industry: Health & Medicine (0.47)

Technology: Information Technology > Artificial Intelligence > Natural Language > Machine Translation (1.00)

Add feedback

Mention Attention for Pronoun Translation

Tang, Gongbo, Hardmeier, Christian

arXiv.org Artificial IntelligenceDec-19-2024

Most pronouns are referring expressions, computers need to resolve what do the pronouns refer to, and there are divergences on pronoun usage across languages. Thus, dealing with these divergences and translating pronouns is a challenge in machine translation. Mentions are referring candidates of pronouns and have closer relations with pronouns compared to general tokens. We assume that extracting additional mention features can help pronoun translation. Therefore, we introduce an additional mention attention module in the decoder to pay extra attention to source mentions but not non-mention tokens. Our mention attention module not only extracts features from source mentions, but also considers target-side context which benefits pronoun translation. In addition, we also introduce two mention classifiers to train models to recognize mentions, whose outputs guide the mention attention. We conduct experiments on the WMT17 English-German translation task, and evaluate our models on general translation and pronoun translation, using BLEU, APT, and contrastive evaluation metrics. Our proposed model outperforms the baseline Transformer model in terms of APT and BLEU scores, this confirms our hypothesis that we can improve pronoun translation by paying additional attention to source mentions, and shows that our introduced additional modules do not have negative effect on the general translation quality.

machine learning, natural language, translation, (13 more...)

arXiv.org Artificial Intelligence

doi: 10.1145/3632971.3632977

2412.14829

Country:

Europe > Spain (0.47)
North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)

Genre: Research Report > New Finding (0.46)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Machine Translation (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)

Add feedback

A Dataset for the Detection of Dehumanizing Language

Engelmann, Paul, Trolle, Peter Brunsgaard, Hardmeier, Christian

arXiv.org Artificial IntelligenceFeb-13-2024

Dehumanization can range from and Haslam (2006), where a sample is considered blatant to subtle forms of varying degrees (Bain dehumanizing if it contains at least one of the following et al., 2009), making automated, general detection categories: negative evaluation of a target difficult. Mendelsohn et al. (2020) present one of group, denial of agency, moral disgust, animal the first computational works on dehumanization metaphors, objectification. Animal metaphors and through explicit feature engineering, using lexicon objectification specifically relate to a human being and word embedding based approaches to detect compared to an animal or object with the intent dehumanizing associations across several years in to cause harm. Trigger Warning: This paper contains a New York Times corpus. Outside of this, there is examples of hateful content that some may little computational work on dehumanization.

dehumanization, machine learning, natural language, (16 more...)

arXiv.org Artificial Intelligence

2402.08764

Country:

Europe (0.29)
North America > United States (0.28)

Genre: Research Report (0.66)

Industry: Government (0.47)

Technology:

Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Communications (0.94)
Information Technology > Artificial Intelligence > Machine Learning (0.68)

Add feedback

Parallel Data Helps Neural Entity Coreference Resolution

Tang, Gongbo, Hardmeier, Christian

arXiv.org Artificial IntelligenceMay-28-2023

Coreference resolution is the task of finding expressions that refer to the same entity in a text. Coreference models are generally trained on monolingual annotated data but annotating coreference is expensive and challenging. Hardmeier et al.(2013) have shown that parallel data contains latent anaphoric knowledge, but it has not been explored in end-to-end neural models yet. In this paper, we propose a simple yet effective model to exploit coreference knowledge from parallel data. In addition to the conventional modules learning coreference from annotations, we introduce an unsupervised module to capture cross-lingual coreference knowledge. Our proposed cross-lingual model achieves consistent improvements, up to 1.74 percentage points, on the OntoNotes 5.0 English dataset using 9 different synthetic parallel datasets. These experimental results confirm that parallel data can provide additional coreference knowledge which is beneficial to coreference resolution tasks.

computational linguistic, machine learning, natural language, (17 more...)

arXiv.org Artificial Intelligence

2305.17709

Country:

Europe (1.00)
Asia (0.69)
North America > United States > Maryland (0.28)
North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)

Genre: Research Report > New Finding (0.88)

Technology:

Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Machine Translation (0.47)

Add feedback

Findings of the 2016 WMT Shared Task on Cross-lingual Pronoun Prediction

Guillou, Liane, Hardmeier, Christian, Nakov, Preslav, Stymne, Sara, Tiedemann, Jörg, Versley, Yannick, Cettolo, Mauro, Webber, Bonnie, Popescu-Belis, Andrei

arXiv.org Artificial IntelligenceNov-27-2019

We describe the design, the evaluation setup, and the results of the 2016 WMT shared task on cross-lingual pronoun prediction. This is a classification task in which participants are asked to provide predictions on what pronoun class label should replace a placeholder value in the target-language text, provided in lemma-tised and PoS-tagged form. We provided four subtasks, for the English-French and English-German language pairs, in both directions. Eleven teams participated in the shared task; nine for the English-French subtask, five for French-English, nine for English-German, and six for German-English. Most of the submissions outperformed two strong language-model- based baseline systems, with systems using deep recurrent neural networks outperforming those using other architectures for most language pairs.

machine translation, neural network, pronoun, (21 more...)

arXiv.org Artificial Intelligence

1911.12091

Country:

North America > United States (1.00)
Asia (1.00)
Europe > United Kingdom > Scotland (0.28)

Genre: Research Report (0.40)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Machine Translation (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)
(4 more...)

Add feedback