AITopics | Rios, Annette

Collaborating Authors

Rios, Annette

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

German also Hallucinates! Inconsistency Detection in News Summaries with the Absinth Dataset

Mascarell, Laura, Chalumattu, Ribin, Rios, Annette

arXiv.org Artificial IntelligenceMar-14-2024

The advent of Large Language Models (LLMs) has led to remarkable progress on a wide range of natural language processing tasks. Despite the advances, these large-sized models still suffer from hallucinating information in their output, which poses a major issue in automatic text summarization, as we must guarantee that the generated summary is consistent with the content of the source document. Previous research addresses the challenging task of detecting hallucinations in the output (i.e. inconsistency detection) in order to evaluate the faithfulness of the generated summaries. However, these works primarily focus on English and recent multilingual approaches lack German data. This work presents absinth, a manually annotated dataset for hallucination detection in German news summarization and explores the capabilities of novel open-source LLMs on this task in both fine-tuning and in-context learning settings. We open-source and release the absinth dataset to foster further research on hallucination detection in German.

computational linguistic, large language model, machine learning, (19 more...)

arXiv.org Artificial Intelligence

doi: 10.3929/ethz-b-000661775

2403.0375

Country:

North America > United States > New Mexico (0.14)
North America > United States > Minnesota (0.14)
North America > United States > Louisiana (0.14)

Genre:

Research Report (0.82)
Personal > Honors (0.69)

Industry: Media > News (0.46)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.75)

Add feedback

Considerations for meaningful sign language machine translation based on glosses

Müller, Mathias, Jiang, Zifan, Moryossef, Amit, Rios, Annette, Ebling, Sarah

arXiv.org Artificial IntelligenceNov-28-2022

Automatic sign language processing is gaining popularity in Natural Language Processing (NLP) research (Yin et al., 2021). In machine translation (MT) in particular, sign language translation based on glosses is a prominent approach. In this paper, we review recent works on neural gloss translation. We find that limitations of glosses in general and limitations of specific datasets are not discussed in a transparent manner and that there is no common standard for evaluation. To address these issues, we put forward concrete recommendations for future research on gloss translation. Our suggestions advocate awareness of the inherent limitations of gloss-based approaches, realistic datasets, stronger baselines and convincing evaluation.

artificial intelligence, natural language, translation, (16 more...)

arXiv.org Artificial Intelligence

2211.15464

Country:

North America > United States (0.46)
Europe > Switzerland (0.29)
Europe > Italy (0.28)

Genre: Overview (1.00)

Industry: Education > Curriculum > Subject-Specific Education (0.90)

Technology: Information Technology > Artificial Intelligence > Natural Language > Machine Translation (1.00)

Add feedback

Quality at a Glance: An Audit of Web-Crawled Multilingual Datasets

Caswell, Isaac, Kreutzer, Julia, Wang, Lisa, Wahab, Ahsan, van Esch, Daan, Ulzii-Orshikh, Nasanbayar, Tapo, Allahsera, Subramani, Nishant, Sokolov, Artem, Sikasote, Claytone, Setyawan, Monang, Sarin, Supheakmungkol, Samb, Sokhar, Sagot, Benoît, Rivera, Clara, Rios, Annette, Papadimitriou, Isabel, Osei, Salomey, Suárez, Pedro Javier Ortiz, Orife, Iroro, Ogueji, Kelechi, Niyongabo, Rubungo Andre, Nguyen, Toan Q., Müller, Mathias, Müller, André, Muhammad, Shamsuddeen Hassan, Muhammad, Nanda, Mnyakeni, Ayanda, Mirzakhalov, Jamshidbek, Matangira, Tapiwanashe, Leong, Colin, Lawson, Nze, Kudugunta, Sneha, Jernite, Yacine, Jenny, Mathias, Firat, Orhan, Dossou, Bonaventure F. P., Dlamini, Sakhile, de Silva, Nisansa, Ballı, Sakine Çabuk, Biderman, Stella, Battisti, Alessia, Baruwa, Ahmed, Bapna, Ankur, Baljekar, Pallavi, Azime, Israel Abebe, Awokoya, Ayodele, Ataman, Duygu, Ahia, Orevaoghene, Ahia, Oghenefego, Agrawal, Sweta, Adeyemi, Mofetoluwa

arXiv.org Artificial IntelligenceMar-22-2021

With the success of large-scale pre-training and multilingual modeling in Natural Language Processing (NLP), recent years have seen a proliferation of large, web-mined text datasets covering hundreds of languages. However, to date there has been no systematic analysis of the quality of these publicly available datasets, or whether the datasets actually contain content in the languages they claim to represent. In this work, we manually audit the quality of 205 language-specific corpora released with five major public datasets (CCAligned, ParaCrawl, WikiMatrix, OSCAR, mC4), and audit the correctness of language codes in a sixth (JW300). We find that lower-resource corpora have systematic issues: at least 15 corpora are completely erroneous, and a significant fraction contains less than 50% sentences of acceptable quality. Similarly, we find 82 corpora that are mislabeled or use nonstandard/ambiguous language codes. We demonstrate that these issues are easy to detect even for non-speakers of the languages in question, and supplement the human judgements with automatic analyses. Inspired by our analysis, we recommend techniques to evaluate and improve multilingual corpora and discuss the risks that come with low-quality data releases.

artificial intelligence, dataset, machine translation, (20 more...)

arXiv.org Artificial Intelligence

2103.12028

Country:

Africa (1.00)
Europe > United Kingdom (0.67)
North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)

Genre: Research Report (0.50)

Industry:

Leisure & Entertainment (0.67)
Media (0.46)

Technology:

Information Technology > Data Science > Data Quality (1.00)
Information Technology > Artificial Intelligence > Natural Language > Machine Translation (0.69)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.46)

Add feedback