Goto

Collaborating Authors

 Riccardi, Giuseppe


LLMs as Repositories of Factual Knowledge: Limitations and Solutions

arXiv.org Artificial Intelligence

LLMs' sources of knowledge are data snapshots containing factual information about entities collected at different timestamps and from different media types (e.g. wikis, social media, etc.). Such unstructured knowledge is subject to change due to updates through time from past to present. Equally important are the inconsistencies and inaccuracies occurring in different information sources. Consequently, the model's knowledge about an entity may be perturbed while training over the sequence of snapshots or at inference time, resulting in inconsistent and inaccurate model performance. In this work, we study the appropriateness of Large Language Models (LLMs) as repositories of factual knowledge. We consider twenty-four state-of-the-art LLMs that are either closed-, partially (weights), or fully (weight and training data) open-source. We evaluate their reliability in responding to time-sensitive factual questions in terms of accuracy and consistency when prompts are perturbed. We further evaluate the effectiveness of state-of-the-art methods to improve LLMs' accuracy and consistency. We then propose "ENtity-Aware Fine-tuning" (ENAF), a soft neurosymbolic approach aimed at providing a structured representation of entities during fine-tuning to improve the model's performance.


Will LLMs Replace the Encoder-Only Models in Temporal Relation Classification?

arXiv.org Artificial Intelligence

The automatic detection of temporal relations among events has been mainly investigated with encoder-only models such as RoBERTa. Large Language Models (LLM) have recently shown promising performance in temporal reasoning tasks such as temporal question answering. Nevertheless, recent studies have tested the LLMs' performance in detecting temporal relations of closed-source models only, limiting the interpretability of those results. In this work, we investigate LLMs' performance and decision process in the Temporal Relation Classification task. First, we assess the performance of seven open and closed-sourced LLMs experimenting with in-context learning and lightweight fine-tuning approaches. Results show that LLMs with in-context learning significantly underperform smaller encoder-only models based on RoBERTa. Then, we delve into the possible reasons for this gap by applying explainable methods. The outcome suggests a limitation of LLMs in this task due to their autoregressive nature, which causes them to focus only on the last part of the sequence. Additionally, we evaluate the word embeddings of these two models to better understand their pre-training differences. The code and the fine-tuned models can be found respectively on GitHub.


DyKnow:Dynamically Verifying Time-Sensitive Factual Knowledge in LLMs

arXiv.org Artificial Intelligence

LLMs acquire knowledge from massive data snapshots collected at different timestamps. Their knowledge is then commonly evaluated using static benchmarks. However, factual knowledge is generally subject to time-sensitive changes, and static benchmarks cannot address those cases. We present an approach to dynamically evaluate the knowledge in LLMs and their time-sensitiveness against Wikidata, a publicly available up-to-date knowledge graph. We evaluate the time-sensitive knowledge in twenty-four private and open-source LLMs, as well as the effectiveness of four editing methods in updating the outdated facts. Our results show that 1) outdatedness is a critical problem across state-of-the-art LLMs; 2) LLMs output inconsistent answers when prompted with slight variations of the question prompt; and 3) the performance of the state-of-the-art knowledge editing algorithms is very limited, as they can not reduce the cases of outdatedness and output inconsistency.


Are LLMs Robust for Spoken Dialogues?

arXiv.org Artificial Intelligence

Large Pre-Trained Language Models have demonstrated state-of-the-art performance in different downstream tasks, including dialogue state tracking and end-to-end response generation. Nevertheless, most of the publicly available datasets and benchmarks on task-oriented dialogues focus on written conversations. Consequently, the robustness of the developed models to spoken interactions is unknown. In this work, we have evaluated the performance of LLMs for spoken task-oriented dialogues on the DSTC11 test sets. Due to the lack of proper spoken dialogue datasets, we have automatically transcribed a development set of spoken dialogues with a state-of-the-art ASR engine. We have characterized the ASR-error types and their distributions and simulated these errors in a large dataset of dialogues. We report the intrinsic (perplexity) and extrinsic (human evaluation) performance of fine-tuned GPT-2 and T5 models in two subtasks of response generation and dialogue state tracking, respectively. The results show that LLMs are not robust to spoken noise by default, however, fine-tuning/training such models on a proper dataset of spoken TODs can result in a more robust performance.


Understanding Emotion Valence is a Joint Deep Learning Task

arXiv.org Artificial Intelligence

The valence analysis of speakers' utterances or written posts helps to understand the activation and variations of the emotional state throughout the conversation. More recently, the concept of Emotion Carriers (EC) has been introduced to explain the emotion felt by the speaker and its manifestations. In this work, we investigate the natural inter-dependency of valence and ECs via a multi-task learning approach. We experiment with Pre-trained Language Models (PLM) for single-task, two-step, and joint settings for the valence and EC prediction tasks. We compare and evaluate the performance of generative (GPT-2) and discriminative (BERT) architectures in each setting. We observed that providing the ground truth label of one task improves the prediction performance of the models in the other task. We further observed that the discriminative model achieves the best trade-off of valence and EC prediction tasks in the joint prediction setting. As a result, we attain a single model that performs both tasks, thus, saving computation resources at training and inference times.


Whats New? Identifying the Unfolding of New Events in Narratives

arXiv.org Artificial Intelligence

Narratives include a rich source of events unfolding over time and context. Automatic understanding of these events provides a summarised comprehension of the narrative for further computation (such as reasoning). In this paper, we study the Information Status (IS) of the events and propose a novel challenging task: the automatic identification of new events in a narrative. We define an event as a triplet of subject, predicate, and object. The event is categorized as new with respect to the discourse context and whether it can be inferred through commonsense reasoning. We annotated a publicly available corpus of narratives with the new events at sentence level using human annotators. We present the annotation protocol and study the quality of the annotation and the difficulty of the task. We publish the annotated dataset, annotation materials, and machine learning baseline models for the task of new event extraction for narrative understanding.


Let's Give a Voice to Conversational Agents in Virtual Reality

arXiv.org Artificial Intelligence

The dialogue experience with conversational agents can be greatly enhanced with multimodal and immersive interactions in virtual reality. In this work, we present an open-source architecture with the goal of simplifying the development of conversational agents operating in virtual environments. The architecture offers the possibility of plugging in conversational agents of different domains and adding custom or cloud-based Speech-To-Text and Text-To-Speech models to make the interaction voice-based. Using this architecture, we present two conversational prototypes operating in the digital health domain developed in Unity for both non-immersive displays and VR headsets.


Response Generation in Longitudinal Dialogues: Which Knowledge Representation Helps?

arXiv.org Artificial Intelligence

Longitudinal Dialogues (LD) are the most challenging type of conversation for human-machine dialogue systems. LDs include the recollections of events, personal thoughts, and emotions specific to each individual in a sparse sequence of dialogue sessions. Dialogue systems designed for LDs should uniquely interact with the users over multiple sessions and long periods of time (e.g. weeks), and engage them in personal dialogues to elaborate on their feelings, thoughts, and real-life events. In this paper, we study the task of response generation in LDs. We evaluate whether general-purpose Pre-trained Language Models (PLM) are appropriate for this purpose. We fine-tune two PLMs, GePpeTto (GPT-2) and iT5, using a dataset of LDs. We experiment with different representations of the personal knowledge extracted from LDs for grounded response generation, including the graph representation of the mentioned events and participants. We evaluate the performance of the models via automatic metrics and the contribution of the knowledge via the Integrated Gradients technique. We categorize the natural language generation errors via human evaluations of contextualization, appropriateness and engagement of the user.


Evaluation of Interpretability for Deep Learning algorithms in EEG Emotion Recognition: A case study in Autism

arXiv.org Artificial Intelligence

Current models on Explainable Artificial Intelligence (XAI) have shown an evident and quantified lack of reliability for measuring feature-relevance when statistically entangled features are proposed for training deep classifiers. There has been an increase in the application of Deep Learning in clinical trials to predict early diagnosis of neuro-developmental disorders, such as Autism Spectrum Disorder (ASD). However, the inclusion of more reliable saliency-maps to obtain more trustworthy and interpretable metrics using neural activity features is still insufficiently mature for practical applications in diagnostics or clinical trials. Moreover, in ASD research the inclusion of deep classifiers that use neural measures to predict viewed facial emotions is relatively unexplored. Therefore, in this study we propose the evaluation of a Convolutional Neural Network (CNN) for electroencephalography (EEG)-based facial emotion recognition decoding complemented with a novel RemOve-And-Retrain (ROAR) methodology to recover highly relevant features used in the classifier. Specifically, we compare well-known relevance maps such as Layer-Wise Relevance Propagation (LRP), PatternNet, Pattern-Attribution, and Smooth-Grad Squared. This study is the first to consolidate a more transparent feature-relevance calculation for a successful EEG-based facial emotion recognition using a within-subject-trained CNN in typically-developed and ASD individuals.


Detecting Emotion Carriers by Combining Acoustic and Lexical Representations

arXiv.org Artificial Intelligence

Personal narratives (PN) - spoken or written - are recollections of facts, people, events, and thoughts from one's own experience. Emotion recognition and sentiment analysis tasks are usually defined at the utterance or document level. However, in this work, we focus on Emotion Carriers (EC) defined as the segments (speech or text) that best explain the emotional state of the narrator ("loss of father", "made me choose"). Once extracted, such EC can provide a richer representation of the user state to improve natural language understanding and dialogue modeling. In previous work, it has been shown that EC can be identified using lexical features. However, spoken narratives should provide a richer description of the context and the users' emotional state. In this paper, we leverage word-based acoustic and textual embeddings as well as early and late fusion techniques for the detection of ECs in spoken narratives. For the acoustic word-level representations, we use Residual Neural Networks (ResNet) pretrained on separate speech emotion corpora and fine-tuned to detect EC. Experiments with different fusion and system combination strategies show that late fusion leads to significant improvements for this task.