AITopics | Navigli, Roberto

Collaborating Authors

Navigli, Roberto

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

Right Answer, Wrong Score: Uncovering the Inconsistencies of LLM Evaluation in Multiple-Choice Question Answering

Molfese, Francesco Maria, Moroni, Luca, Gioffrè, Luca, Scirè, Alessandro, Conia, Simone, Navigli, Roberto

arXiv.org Artificial IntelligenceMar-19-2025

One of the most widely used tasks to evaluate Large Language Models (LLMs) is Multiple-Choice Question Answering (MCQA). While open-ended question answering tasks are more challenging to evaluate, MCQA tasks are, in principle, easier to assess, as the model's answer is thought to be simple to extract and is directly compared to a set of predefined choices. However, recent studies have started to question the reliability of MCQA evaluation, showing that multiple factors can significantly impact the reported performance of LLMs, especially when the model generates free-form text before selecting one of the answer choices. In this work, we shed light on the inconsistencies of MCQA evaluation strategies, which can lead to inaccurate and misleading model comparisons. We systematically analyze whether existing answer extraction methods are aligned with human judgment, and how they are influenced by answer constraints in the prompt across different domains. Our experiments demonstrate that traditional evaluation strategies often underestimate LLM capabilities, while LLM-based answer extractors are prone to systematic errors. Moreover, we reveal a fundamental trade-off between including format constraints in the prompt to simplify answer extraction and allowing models to generate free-form text to improve reasoning. Our findings call for standardized evaluation methodologies and highlight the need for more reliable and consistent MCQA evaluation practices.

artificial intelligence, large language model, natural language, (17 more...)

arXiv.org Artificial Intelligence

2503.14996

Country:

Europe (0.68)
Asia > China (0.28)
North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)

Genre: Research Report > New Finding (1.00)

Industry: Education (1.00)

Technology: Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)

Add feedback

LLMs Lost in Translation: M-ALERT uncovers Cross-Linguistic Safety Gaps

Friedrich, Felix, Tedeschi, Simone, Schramowski, Patrick, Brack, Manuel, Navigli, Roberto, Nguyen, Huu, Li, Bo, Kersting, Kristian

arXiv.org Artificial IntelligenceDec-19-2024

Building safe Large Language Models (LLMs) across multiple languages is essential in ensuring both safe access and linguistic diversity. To this end, we introduce M-ALERT, a multilingual benchmark that evaluates the safety of LLMs in five languages: English, French, German, Italian, and Spanish. M-ALERT includes 15k high-quality prompts per language, totaling 75k, following the detailed ALERT taxonomy. Our extensive experiments on 10 state-of-the-art LLMs highlight the importance of language-specific safety analysis, revealing that models often exhibit significant inconsistencies in safety across languages and categories. For instance, Llama3.2 shows high unsafety in the category crime_tax for Italian but remains safe in other languages. Similar differences can be observed across all models. In contrast, certain categories, such as substance_cannabis and crime_propaganda, consistently trigger unsafe responses across models and languages. These findings underscore the need for robust multilingual safety practices in LLMs to ensure safe and responsible usage across diverse user communities.

large language model, natural language, qwen2, (17 more...)

arXiv.org Artificial Intelligence

2412.15035

Country:

North America > United States (0.93)
Europe (0.92)

Genre: Research Report > New Finding (0.87)

Industry: Government > Regional Government > North America Government > United States Government (0.46)

Technology: Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)

Add feedback

Word Sense Linking: Disambiguating Outside the Sandbox

Bejgu, Andrei Stefan, Barba, Edoardo, Procopio, Luigi, Fernández-Castro, Alberte, Navigli, Roberto

arXiv.org Artificial IntelligenceDec-12-2024

Word Sense Disambiguation (WSD) is the task of associating a word in a given context with its most suitable meaning among a set of possible candidates. While the task has recently witnessed renewed interest, with systems achieving performances above the estimated inter-annotator agreement, at the time of writing it still struggles to find downstream applications. We argue that one of the reasons behind this is the difficulty of applying WSD to plain text. Indeed, in the standard formulation, models work under the assumptions that a) all the spans to disambiguate have already been identified, and b) all the possible candidate senses of each span are provided, both of which are requirements that are far from trivial. In this work, we present a new task called Word Sense Linking (WSL) where, given an input text and a reference sense inventory, systems have to both identify which spans to disambiguate and then link them to their most suitable meaning.We put forward a transformer-based architecture for the task and thoroughly evaluate both its performance and those of state-of-the-art WSD systems scaled to WSL, iteratively relaxing the assumptions of WSD. We hope that our work will foster easier integration of lexical semantics into downstream applications.

artificial intelligence, computational linguistic, natural language, (17 more...)

arXiv.org Artificial Intelligence

doi: 10.18653/v1/2024.findings-acl.851

2412.0937

Country:

Europe (1.00)
Asia (0.68)
North America > United States > Minnesota (0.28)

Genre: Research Report > New Finding (0.46)

Technology: Information Technology > Artificial Intelligence > Natural Language > Text Processing (0.70)

Add feedback

Truth or Mirage? Towards End-to-End Factuality Evaluation with LLM-Oasis

Scirè, Alessandro, Bejgu, Andrei Stefan, Tedeschi, Simone, Ghonim, Karim, Martelli, Federico, Navigli, Roberto

arXiv.org Artificial IntelligenceDec-2-2024

After the introduction of Large Language Models (LLMs), there have been substantial improvements in the performance of Natural Language Generation (NLG) tasks, including Text Summarization and Machine Translation. However, LLMs still produce outputs containing hallucinations, that is, content not grounded in factual information. Therefore, developing methods to assess the factuality of LLMs has become urgent. Indeed, resources for factuality evaluation have recently emerged. Although challenging, these resources face one or more of the following limitations: (i) they are tailored to a specific task or domain; (ii) they are limited in size, thereby preventing the training of new factuality evaluators; (iii) they are designed for simpler verification tasks, such as claim verification. To address these issues, we introduce LLM-Oasis, to the best of our knowledge the largest resource for training end-to-end factuality evaluators. LLM-Oasis is constructed by extracting claims from Wikipedia, falsifying a subset of these claims, and generating pairs of factual and unfactual texts. We then rely on human annotators to both validate the quality of our dataset and to create a gold standard test set for benchmarking factuality evaluation systems. Our experiments demonstrate that LLM-Oasis presents a significant challenge for state-of-the-art LLMs, with GPT-4o achieving up to 60% accuracy in our proposed end-to-end factuality evaluation task, highlighting its potential to drive future research in the field.

large language model, machine learning, natural language, (17 more...)

arXiv.org Artificial Intelligence

2411.19655

Country:

Asia (0.93)
North America > United States > Louisiana (0.14)
Europe > Middle East > Malta (0.14)

Genre:

Workflow (1.00)
Research Report (1.00)

Industry: Government > Military (0.68)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

Multimodal Large Language Models and Tunings: Vision, Language, Sensors, Audio, and Beyond

Han, Soyeon Caren, Cao, Feiqi, Poon, Josiah, Navigli, Roberto

arXiv.org Artificial IntelligenceOct-7-2024

This tutorial explores recent advancements in multimodal pretrained and large models, capable of integrating and processing diverse data forms such as text, images, audio, and video. Participants will gain an understanding of the foundational concepts of multimodality, the evolution of multimodal research, and the key technical challenges addressed by these models. We will cover the latest multimodal datasets and pretrained models, including those beyond vision and language. Additionally, the tutorial will delve into the intricacies of multimodal large models and instruction tuning strategies to optimise performance for specific tasks. Hands-on laboratories will offer practical experience with state-of-the-art multimodal models, demonstrating real-world applications like visual storytelling and visual question answering. This tutorial aims to equip researchers, practitioners, and newcomers with the knowledge and skills to leverage multimodal AI. ACM Multimedia 2024 is the ideal venue for this tutorial, aligning perfectly with our goal of understanding multimodal pretrained and large language models, and their tuning mechanisms.

large language model, machine learning, natural language, (18 more...)

arXiv.org Artificial Intelligence

2410.05608

Country: Oceania > Australia (0.34)

Genre: Instructional Material > Course Syllabus & Notes (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.71)

Add feedback

ZEBRA: Zero-Shot Example-Based Retrieval Augmentation for Commonsense Question Answering

Molfese, Francesco Maria, Conia, Simone, Orlando, Riccardo, Navigli, Roberto

arXiv.org Artificial IntelligenceOct-7-2024

Current Large Language Models (LLMs) have shown strong reasoning capabilities in commonsense question answering benchmarks, but the process underlying their success remains largely opaque. As a consequence, recent approaches have equipped LLMs with mechanisms for knowledge retrieval, reasoning and introspection, not only to improve their capabilities but also to enhance the interpretability of their outputs. However, these methods require additional training, hand-crafted templates or human-written explanations. To address these issues, we introduce ZEBRA, a zero-shot question answering framework that combines retrieval, case-based reasoning and introspection and dispenses with the need for additional training of the LLM. Given an input question, ZEBRA retrieves relevant question-knowledge pairs from a knowledge base and generates new knowledge by reasoning over the relationships in these pairs. This generated knowledge is then used to answer the input question, improving the model's performance and interpretability. We evaluate our approach across 8 well-established commonsense reasoning benchmarks, demonstrating that ZEBRA consistently outperforms strong LLMs and previous knowledge integration approaches, achieving an average accuracy improvement of up to 4.5 points.

large language model, machine learning, natural language, (19 more...)

arXiv.org Artificial Intelligence

2410.05077

Country:

Europe (0.93)
Asia (0.67)
North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)

Genre: Research Report (0.82)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.70)

Add feedback

Beyond Correlation: Interpretable Evaluation of Machine Translation Metrics

Perrella, Stefano, Proietti, Lorenzo, Cabot, Pere-Lluís Huguet, Barba, Edoardo, Navigli, Roberto

arXiv.org Artificial IntelligenceOct-7-2024

Machine Translation (MT) evaluation metrics assess translation quality automatically. Recently, researchers have employed MT metrics for various new use cases, such as data filtering and translation re-ranking. However, most MT metrics return assessments as scalar scores that are difficult to interpret, posing a challenge to making informed design choices. Moreover, MT metrics' capabilities have historically been evaluated using correlation with human judgment, which, despite its efficacy, falls short of providing intuitive insights into metric performance, especially in terms of new metric use cases. To address these issues, we introduce an interpretable evaluation framework for MT metrics. Within this framework, we evaluate metrics in two scenarios that serve as proxies for the data filtering and translation re-ranking use cases. Furthermore, by measuring the performance of MT metrics using Precision, Recall, and F-score, we offer clearer insights into their capabilities than correlation with human judgments. Finally, we raise concerns regarding the reliability of manually curated data following the Direct Assessments+Scalar Quality Metrics (DA+SQM) guidelines, reporting a notably low agreement with Multidimensional Quality Metrics (MQM) annotations.

artificial intelligence, machine translation, natural language, (14 more...)

arXiv.org Artificial Intelligence

2410.05183

Country:

Europe (1.00)
Asia > Middle East > UAE (0.14)
North America > United States > Massachusetts (0.14)
North America > Mexico > Mexico City (0.14)

Genre: Research Report > New Finding (0.46)

Technology: Information Technology > Artificial Intelligence > Natural Language > Machine Translation (1.00)

Add feedback

Guardians of the Machine Translation Meta-Evaluation: Sentinel Metrics Fall In!

Perrella, Stefano, Proietti, Lorenzo, Scirè, Alessandro, Barba, Edoardo, Navigli, Roberto

arXiv.org Artificial IntelligenceAug-25-2024

Annually, at the Conference of Machine Translation (WMT), the Metrics Shared Task organizers conduct the meta-evaluation of Machine Translation (MT) metrics, ranking them according to their correlation with human judgments. Their results guide researchers toward enhancing the next generation of metrics and MT systems. With the recent introduction of neural metrics, the field has witnessed notable advancements. Nevertheless, the inherent opacity of these metrics has posed substantial challenges to the meta-evaluation process. This work highlights two issues with the meta-evaluation framework currently employed in WMT, and assesses their impact on the metrics rankings. To do this, we introduce the concept of sentinel metrics, which are designed explicitly to scrutinize the meta-evaluation process's accuracy, robustness, and fairness. By employing sentinel metrics, we aim to validate our findings, and shed light on and monitor the potential biases or inconsistencies in the rankings. We discover that the present meta-evaluation framework favors two categories of metrics: i) those explicitly trained to mimic human quality assessments, and ii) continuous metrics. Finally, we raise concerns regarding the evaluation capabilities of state-of-the-art metrics, emphasizing that they might be basing their assessments on spurious correlations found in their training data.

machine learning, metric, natural language, (15 more...)

arXiv.org Artificial Intelligence

2408.13831

Country:

Europe (1.00)
North America > United States (0.67)

Genre: Research Report > New Finding (0.34)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Machine Translation (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)

Add feedback

ALERT: A Comprehensive Benchmark for Assessing Large Language Models' Safety through Red Teaming

Tedeschi, Simone, Friedrich, Felix, Schramowski, Patrick, Kersting, Kristian, Navigli, Roberto, Nguyen, Huu, Li, Bo

arXiv.org Artificial IntelligenceJun-24-2024

When building Large Language Models (LLMs), it is paramount to bear safety in mind and protect them with guardrails. Indeed, LLMs should never generate content promoting or normalizing harmful, illegal, or unethical behavior that may contribute to harm to individuals or society. This principle applies to both normal and adversarial use. In response, we introduce ALERT, a large-scale benchmark to assess safety based on a novel fine-grained risk taxonomy. It is designed to evaluate the safety of LLMs through red teaming methodologies and consists of more than 45k instructions categorized using our novel taxonomy. By subjecting LLMs to adversarial testing scenarios, ALERT aims to identify vulnerabilities, inform improvements, and enhance the overall safety of the language models. Furthermore, the fine-grained taxonomy enables researchers to perform an in-depth evaluation that also helps one to assess the alignment with various policies. In our experiments, we extensively evaluate 10 popular open- and closed-source LLMs and demonstrate that many of them still struggle to attain reasonable levels of safety.

category, large language model, machine learning, (20 more...)

arXiv.org Artificial Intelligence

2404.08676

Country:

North America > United States (0.67)
Europe (0.46)

Genre: Research Report > New Finding (0.48)

Industry:

Law > Criminal Law (1.00)
Law Enforcement & Public Safety > Crime Prevention & Enforcement (1.00)
Health & Medicine (1.00)
(2 more...)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

Aurora-M: The First Open Source Multilingual Language Model Red-teamed according to the U.S. Executive Order

Nakamura, Taishi, Mishra, Mayank, Tedeschi, Simone, Chai, Yekun, Stillerman, Jason T, Friedrich, Felix, Yadav, Prateek, Laud, Tanmay, Chien, Vu Minh, Zhuo, Terry Yue, Misra, Diganta, Bogin, Ben, Vu, Xuan-Son, Karpinska, Marzena, Dantuluri, Arnav Varma, Kusa, Wojciech, Furlanello, Tommaso, Yokota, Rio, Muennighoff, Niklas, Pai, Suhas, Adewumi, Tosin, Laippala, Veronika, Yao, Xiaozhe, Junior, Adalberto, Ariyak, Alpay, Drozd, Aleksandr, Clive, Jordan, Gupta, Kshitij, Chen, Liangyu, Sun, Qi, Tsui, Ken, Persaud, Noah, Fahmy, Nour, Chen, Tianlong, Bansal, Mohit, Monti, Nicolo, Dang, Tai, Luo, Ziyang, Bui, Tien-Tung, Navigli, Roberto, Mehta, Virendra, Blumberg, Matthew, May, Victor, Nguyen, Huu, Pyysalo, Sampo

arXiv.org Artificial IntelligenceApr-23-2024

Pretrained language models underpin several AI applications, but their high computational cost for training limits accessibility. Initiatives such as BLOOM and StarCoder aim to democratize access to pretrained models for collaborative community development. However, such existing models face challenges: limited multilingual capabilities, continual pretraining causing catastrophic forgetting, whereas pretraining from scratch is computationally expensive, and compliance with AI safety and development laws. This paper presents Aurora-M, a 15B parameter multilingual open-source model trained on English, Finnish, Hindi, Japanese, Vietnamese, and code. Continually pretrained from StarCoderPlus on 435 billion additional tokens, Aurora-M surpasses 2 trillion tokens in total training token count. It is the first open-source multilingual model fine-tuned on human-reviewed safety instructions, thus aligning its development not only with conventional red-teaming considerations, but also with the specific concerns articulated in the Biden-Harris Executive Order on the Safe, Secure, and Trustworthy Development and Use of Artificial Intelligence. Aurora-M is rigorously evaluated across various tasks and languages, demonstrating robustness against catastrophic forgetting and outperforming alternatives in multilingual settings, particularly in safety evaluations. To promote responsible open-source LLM development, Aurora-M and its variants are released at https://huggingface.co/collections/aurora-m/aurora-m-models-65fdfdff62471e09812f5407 .

large language model, machine learning, natural language, (17 more...)

arXiv.org Artificial Intelligence

2404.00399

Country:

Europe (1.00)
Asia (1.00)
North America > United States > Massachusetts (0.14)

Genre: Research Report (0.64)

Industry:

Law (1.00)
Information Technology > Security & Privacy (1.00)
Health & Medicine (1.00)
(3 more...)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.94)

Add feedback