miscommunication
The Decrypto Benchmark for Multi-Agent Reasoning and Theory of Mind
Lupu, Andrei, Willi, Timon, Foerster, Jakob
As Large Language Models (LLMs) gain agentic abilities, they will have to navigate complex multi-agent scenarios, interacting with human users and other agents in cooperative and competitive settings. This will require new reasoning skills, chief amongst them being theory of mind (ToM), or the ability to reason about the "mental" states of other agents. However, ToM and other multi-agent abilities in LLMs are poorly understood, since existing benchmarks suffer from narrow scope, data leakage, saturation, and lack of interactivity. We thus propose Decrypto, a game-based benchmark for multi-agent reasoning and ToM drawing inspiration from cognitive science, computational pragmatics and multi-agent reinforcement learning. It is designed to be as easy as possible in all other dimensions, eliminating confounding factors commonly found in other benchmarks. To our knowledge, it is also the first platform for designing interactive ToM experiments. We validate the benchmark design through comprehensive empirical evaluations of frontier LLMs, robustness studies, and human-AI cross-play experiments. We find that LLM game-playing abilities lag behind humans and simple word-embedding baselines. We then create variants of two classic cognitive science experiments within Decrypto to evaluate three key ToM abilities. Surprisingly, we find that state-of-the-art reasoning models are significantly worse at those tasks than their older counterparts. This demonstrates that Decrypto addresses a crucial gap in current reasoning and ToM evaluations, and paves the path towards better artificial agents.
Why Robots Are Bad at Detecting Their Mistakes: Limitations of Miscommunication Detection in Human-Robot Dialogue
Janssens, Ruben, De Bock, Jens, Labat, Sofie, Verhelst, Eva, Hoste, Veronique, Belpaeme, Tony
-- Detecting miscommunication in human-robot interaction is a critical function for maintaining user engagement and trust. While humans effortlessly detect communication errors in conversations through both verbal and non-verbal cues, robots face significant challenges in interpreting nonverbal feedback, despite advances in computer vision for recognizing affective expressions. This research evaluates the effectiveness of machine learning models in detecting miscom-munications in robot dialogue. Using a multi-modal dataset of 240 human-robot conversations, where four distinct types of conversational failures were systematically introduced, we assess the performance of state-of-the-art computer vision models. After each conversational turn, users provided feedback on whether they perceived an error, enabling an analysis of the models' ability to accurately detect robot mistakes. Despite using state-of-the-art models, the performance barely exceeds random chance in identifying miscommunication, while on a dataset with more expressive emotional content, they successfully identified confused states. T o explore the underlying cause, we asked human raters to do the same. They could also only identify around half of the induced miscommunications, similarly to our model. These results uncover a fundamental limitation in identifying robot miscommunications in dialogue: even when users perceive the induced miscommunication as such, they often do not communicate this to their robotic conversation partner . This knowledge can shape expectations of the performance of computer vision models and can help researchers to design better human-robot conversations by deliberately eliciting feedback where needed. In dialogue, individuals do more than merely interpret their interlocutors' words; they also seek feedback regarding the ongoing interaction.
CoPrUS: Consistency Preserving Utterance Synthesis towards more realistic benchmark dialogues
Steindl, Sebastian, Schรคfer, Ulrich, Ludwig, Bernd
Large-scale Wizard-Of-Oz dialogue datasets have enabled the training of deep learning-based dialogue systems. While they are successful as benchmark datasets, they lack certain types of utterances, which would make them more realistic. In this work, we investigate the creation of synthetic communication errors in an automatic pipeline. Based on linguistic theory, we propose and follow a simple error taxonomy. We focus on three types of miscommunications that could happen in real-world dialogues but are underrepresented in the benchmark dataset: misunderstandings, non-understandings and vaguely related questions. Our two-step approach uses a state-of-the-art Large Language Model (LLM) to first create the error and secondly the repairing utterance. We perform Language Model-based evaluation to ensure the quality of the generated utterances. We apply the method to the MultiWOZ dataset and evaluate it both qualitatively and empirically as well as with human judges. Our results indicate that current LLMs can aid in adding post-hoc miscommunications to benchmark datasets as a form of data augmentation. We publish the resulting dataset, in which nearly 1900 dialogues have been modified, as CoPrUS-MultiWOZ to facilitate future work on dialogue systems.
No that's not what I meant: Handling Third Position Repair in Conversational Question Answering
Balaraman, Vevake, Eshghi, Arash, Konstas, Ioannis, Papaioannou, Ioannis
The ability to handle miscommunication is crucial to robust and faithful conversational AI. People usually deal with miscommunication immediately as they detect it, using highly systematic interactional mechanisms called repair. One important type of repair is Third Position Repair (TPR) whereby a speaker is initially misunderstood but then corrects the misunderstanding as it becomes apparent after the addressee's erroneous response. Here, we collect and publicly release Repair-QA, the first large dataset of TPRs in a conversational question answering (QA) setting. The data is comprised of the TPR turns, corresponding dialogue contexts, and candidate repairs of the original turn for execution of TPRs. We demonstrate the usefulness of the data by training and evaluating strong baseline models for executing TPRs. For stand-alone TPR execution, we perform both automatic and human evaluations on a fine-tuned T5 model, as well as OpenAI's GPT-3 LLMs. Additionally, we extrinsically evaluate the LLMs' TPR processing capabilities in the downstream conversational QA task. The results indicate poor out-of-the-box performance on TPR's by the GPT-3 models, which then significantly improves when exposed to Repair-QA.
Why Implementing RPA in your Revenue Cycle is Crucial?
Are you tired of spending countless hours on mundane, repetitive tasks that drain your energy and hinder your productivity? Do you wish there was a way to streamline your operations and reduce errors while freeing up your time to focus on growing your business? Look no further than Robotic Process Automation (RPA)! RPA is a technology that uses software robots to automate tedious and time-consuming tasks, freeing up valuable resources and improving overall efficiency. As a Healthcare Revenue Cycle Business Owner, you can benefit from RPA in several ways.
Trump Crony Proves Widespread Voter Fraud Doesn't Exist
Did voter fraud swing New Hampshire away from Donald Trump in the 2016 election? Absolutely not, according to an exhaustive investigation conducted by the state's attorney general and secretary of state, which, counter to Trump's persistent allegations, turned up no evidence of "serious voter fraud." Instead, the inquiry provided further evidence that the tools Republicans use to detect voter fraud are fatally flawed, churning out a huge number of false positives. And while the New Hampshire investigation ultimately debunked Trump's paranoia, it came perilously close to disenfranchising thousands of lawful voters. Republicans have seized upon New Hampshire as the putative epicenter of American voter fraud for two reasons.
How Not To Lie With Statistics
"What is truth?" and "What is a lie?" are questions that have drawn the attention of philosophers, theologians, legal scholars and intellectuals of many kinds for centuries. I am not a scholar or intellectual, merely a hardhat statistician working in marketing research and what is vaguely called data science. Regardless of what we do for a living, however, all of us are consumers of statistics at work and in our daily lives. "Statistics" can refer to figures or mathematical models, and either can be used to deceive us, are often misinterpreted or can be flat out wrong. Deception in various forms can be found in nature, and pet owners may have noticed that it is not exclusively a human trait.
Emoji meanings vary hugely between platforms, meaning characters can lead to vast miscommunication, study finds
Nasa has announced that it has found evidence of flowing water on Mars. Scientists have long speculated that Recurring Slope Lineae -- or dark patches -- on Mars were made up of briny water but the new findings prove that those patches are caused by liquid water, which it has established by finding hydrated salts. Several hundred camped outside the London store in Covent Garden. The 6s will have new features like a vastly improved camera and a pressure-sensitive "3D Touch" display
Towards Overcoming Miscommunication in Situated Dialogue by Asking Questions
Marge, Matthew (Carnegie Mellon University) | Rudnicky, Alexander I. (Carnegie Mellon University)
Situated dialogue is prominent in the robot navigation task, where a human gives route instructions (i.e., a sequence of navigation commands) to an agent. We propose an approach for situated dialogue agents whereby they use strategies such as asking questions to repair or recover from unclear instructions, namely those that an agent misunderstands or considers ambiguous. Most immediately in this work we study examples from existing human-human dialogue corpora and relate them to our proposed approach.
Detecting, Repairing, and Preventing Human-Machine Miscommunication
The next portion of the workshop was devoted to different approaches to preventing and repairing miscommunication. These sessions represent a progression between different parts of their discourse Research related to achieving from work that clarifies the model or between the discourse robust interaction is an important problem of miscommunication to model and the domain model. Early work concerned work that describes the strategies The last session was the presentation the correction of spelling or grammatical used to repair miscommunication. I of work involving deployed systems errors in a user's utterance so review the most significant issues using speech as a mode of interaction. The approaches were constrained by their have assumed that the system's model differed in two dimensions: First, experimenters impact on overall system performance is always correct.