Goto

Collaborating Authors

 natural language generation



Exploring the Influence of Relevant Knowledge for Natural Language Generation Interpretability

Martínez-Murillo, Iván, Moreda, Paloma, Lloret, Elena

arXiv.org Artificial Intelligence

This paper explores the influence of external knowledge integration in Natural Language Generation (NLG), focusing on a commonsense generation task. We extend the CommonGen dataset by creating KITGI, a benchmark that pairs input concept sets with retrieved semantic relations from ConceptNet and includes manually annotated outputs. Using the T5-Large model, we compare sentence generation under two conditions: with full external knowledge and with filtered knowledge where highly relevant relations were deliberately removed. Our interpretability benchmark follows a three-stage method: (1) identifying and removing key knowledge, (2) regenerating sentences, and (3) manually assessing outputs for commonsense plausibility and concept coverage. Results show that sentences generated with full knowledge achieved 91\% correctness across both criteria, while filtering reduced performance drastically to 6\%. These findings demonstrate that relevant external knowledge is critical for maintaining both coherence and concept coverage in NLG. This work highlights the importance of designing interpretable, knowledge-enhanced NLG systems and calls for evaluation frameworks that capture the underlying reasoning beyond surface-level metrics.


The QCET Taxonomy of Standard Quality Criterion Names and Definitions for the Evaluation of NLP Systems

Belz, Anya, Mille, Simon, Thomson, Craig

arXiv.org Artificial Intelligence

Prior work has shown that two NLP evaluation experiments that report results for the same quality criterion name (e.g. Fluency) do not necessarily evaluate the same aspect of quality, and the comparability implied by the name can be misleading. Not knowing when two evaluations are comparable in this sense means we currently lack the ability to draw reliable conclusions about system quality on the basis of multiple, independently conducted evaluations. This in turn hampers the ability of the field to progress scientifically as a whole, a pervasive issue in NLP since its beginning (Sparck Jones, 1981). It is hard to see how the issue of unclear comparability can be fully addressed other than by the creation of a standard set of quality criterion names and definitions that the several hundred quality criterion names actually in use in the field can be mapped to, and grounded in. Taking a strictly descriptive approach, the QCET Quality Criteria for Evaluation Taxonomy derives a standard set of quality criterion names and definitions from three surveys of evaluations reported in NLP, and structures them into a hierarchy where each parent node captures common aspects of its child nodes. We present QCET and the resources it consists of, and discuss its three main uses in (i) establishing comparability of existing evaluations, (ii) guiding the design of new evaluations, and (iii) assessing regulatory compliance.


Rule-Based Moral Principles for Explaining Uncertainty in Natural Language Generation

Atf, Zahra, Lewis, Peter R

arXiv.org Artificial Intelligence

Abstract--Rule-Based Moral Principles for Explaining Uncertainty in Natural Language Generation As large language models (LLMs) are increasingly used in high-stakes applications, the challenge of explaining uncertainty in natural language generation has become both a technical and moral imperative. Traditional approaches rely on probabilistic methods that are often opaque, difficult to interpret, and misaligned with human expectations of transparency and accountability. In response to these limitations, this paper introduces a novel framework based on rule-based moral principles--simple, human-inspired ethical guidelines--for responding to uncertainty in LLM-generated text. Drawing on insights from experimental moral psychology and virtue ethics, we define a set of symbolic behavioral rules such as precaution, deference, and responsibility to guide system responses under conditions of epistemic or aleatoric uncertainty. These rules are implemented declaratively and are designed to generate adaptive, context-sensitive explanations even in the absence of precise confidence metrics. The moral principles are encoded as symbolic rules within a lightweight Prolog-based engine, where each uncertainty tag (low, medium, high) activates an ethically aligned system action along with an automatically generated, plain-language rationale. We evaluate the framework through scenario-based simulations that benchmark rule coverage, assess fairness implications, and analyze trust calibration. An interpretive explanation module is integrated to reveal both the assigned uncertainty level and its underlying justification in a transparent and accessible way. We illustrate the framework through hypothetical yet plausible use cases in clinical and legal domains, demonstrating how rule-based moral reasoning can enhance user trust, promote fairness, and improve the interpretability of AI-generated language. By offering a lightweight, philosophically grounded alternative to probabilistic uncertainty modeling, our approach paves the way for more ethical, human-aligned, and socially responsible natural language generation.


An End-to-End System for Culturally-Attuned Driving Feedback using a Dual-Component NLG Engine

Thompson, Iniakpokeikiye Peter, Dewei, Yi, Ehud, Reiter

arXiv.org Artificial Intelligence

This paper presents an end-to-end mobile system that delivers culturally-attuned safe driving feedback to drivers in Nigeria, a low-resource environment with significant infrastructural challenges. The core of the system is a novel dual-component Natural Language Generation (NLG) engine that provides both legally-grounded safety tips and persuasive, theory-driven behavioural reports. We describe the complete system architecture, including an automatic trip detection service, on-device behaviour analysis, and a sophisticated NLG pipeline that leverages a two-step reflection process to ensure high-quality feedback. The system also integrates a specialized machine learning model for detecting alcohol-influenced driving, a key local safety issue. The architecture is engineered for robustness against intermittent connectivity and noisy sensor data. A pilot deployment with 90 drivers demonstrates the viability of our approach, and initial results on detected unsafe behaviours are presented. This work provides a framework for applying data-to-text and AI systems to achieve social good.



My Life in Artificial Intelligence: People, anecdotes, and some lessons learnt

van Deemter, Kees

arXiv.org Artificial Intelligence

In this very personal workography, I relate my 40-year experiences as a researcher and educator in and around Artificial Intelligence (AI), more specifically Natural Language Processing. I describe how curiosity, and the circumstances of the day, led me to work in both industry and academia, and in various countries, including The Netherlands (Amsterdam, Eindhoven, and Utrecht), the USA (Stanford), England (Brighton), Scotland (Aberdeen), and China (Beijing and Harbin). People and anecdotes play a large role in my story; the history of AI forms its backdrop. I focus on things that might be of interest to (even) younger colleagues, given the choices they face in their own work and life at a time when AI is finally emerging from the shadows.


Natural Language Generation in Healthcare: A Review of Methods and Applications

Lyu, Mengxian, Li, Xiaohan, Chen, Ziyi, Pan, Jinqian, Peng, Cheng, Talankar, Sankalp, Wu, Yonghui

arXiv.org Artificial Intelligence

Natural language generation (NLG) is the key technology to achieve generative artificial intelligence (AI). With the breakthroughs in large language models (LLMs), NLG has been widely used in various medical applications, demonstrating the potential to enhance clinical workflows, support clinical decision-making, and improve clinical documentation. Heterogeneous and diverse medical data modalities, such as medical text, images, and knowledge bases, are utilized in NLG. Researchers have proposed many generative models and applied them in a number of healthcare applications. There is a need for a comprehensive review of NLG methods and applications in the medical domain. In this study, we systematically reviewed 113 scientific publications from a total of 3,988 NLG-related articles identified using a literature search, focusing on data modality, model architecture, clinical applications, and evaluation methods. Following PRISMA (Preferred Reporting Items for Systematic reviews and Meta-Analyses) guidelines, we categorize key methods, identify clinical applications, and assess their capabilities, limitations, and emerging challenges. This timely review covers the key NLG technologies and medical applications and provides valuable insights for future studies to leverage NLG to transform medical discovery and healthcare.


JaccDiv: A Metric and Benchmark for Quantifying Diversity of Generated Marketing Text in the Music Industry

Afzal, Anum, Mercier, Alexandre, Matthes, Florian

arXiv.org Artificial Intelligence

Online platforms are increasingly interested in using Data-to-Text technologies to generate content and help their users. Unfortunately, traditional generative methods often fall into repetitive patterns, resulting in monotonous galleries of texts after only a few iterations. In this paper, we investigate LLM-based data-to-text approaches to automatically generate marketing texts that are of sufficient quality and diverse enough for broad adoption. We leverage Language Models such as T5, GPT -3.5, GPT -4, and LLaMa2 in conjunction with fine-tuning, few-shot, and zero-shot approaches to set a baseline for diverse marketing texts. We also introduce a metric JaccDiv to evaluate the diversity of a set of texts. This research extends its relevance beyond the music industry, proving beneficial in various fields where repetitive automated content generation is prevalent.


Anyprefer: An Agentic Framework for Preference Data Synthesis

Zhou, Yiyang, Wang, Zhaoyang, Wang, Tianle, Xing, Shangyu, Xia, Peng, Li, Bo, Zheng, Kaiyuan, Zhang, Zijian, Chen, Zhaorun, Zheng, Wenhao, Zhang, Xuchao, Bansal, Chetan, Zhang, Weitong, Wei, Ying, Bansal, Mohit, Yao, Huaxiu

arXiv.org Artificial Intelligence

High-quality preference data is essential for aligning foundation models with human values through preference learning. However, manual annotation of such data is often time-consuming and costly. Recent methods often adopt a self-rewarding approach, where the target model generates and annotates its own preference data, but this can lead to inaccuracies since the reward model shares weights with the target model, thereby amplifying inherent biases. To address these issues, we propose Anyprefer, a framework designed to synthesize high-quality preference data for aligning the target model. Anyprefer frames the data synthesis process as a cooperative two-player Markov Game, where the target model and the judge model collaborate together. Here, a series of external tools are introduced to assist the judge model in accurately rewarding the target model's responses, mitigating biases in the rewarding process. In addition, a feedback mechanism is introduced to optimize prompts for both models, enhancing collaboration and improving data quality. The synthesized data is compiled into a new preference dataset, Anyprefer-V1, consisting of 58K high-quality preference pairs. Extensive experiments show that Anyprefer significantly improves model alignment performance across four main applications, covering 21 datasets, achieving average improvements of 18.55% in five natural language generation datasets, 3.66% in nine vision-language understanding datasets, 30.05% in three medical image analysis datasets, and 16.00% in four visuo-motor control tasks.