Goto

Collaborating Authors

 Khebour, Ibrahim


TRACE: Real-Time Multimodal Common Ground Tracking in Situated Collaborative Dialogues

arXiv.org Artificial Intelligence

In situations the following novel and unique contributions in a involving hybrid human-AI teams, although there single system: is an increasing desire for AIs that act as collaborators Real-time tracking of participant speech, actions, with humans, modern AI systems struggle to gesture, and gaze when engaging in a account for such mental states in their human interlocutors shared task; (Sap et al., 2022; Ullman, 2023) that might expose shared or conflicting beliefs, and thus predict On-the-fly interpretation and integration of and explain in-context behavior (Premack and multimodal signals to provide a complete Woodruff, 1978). Additionally, in realistic scenarios scene representation for inference; such as collaborative problem solving (Nelson, Simultaneous detection of asserted propositional 2013), beliefs are communicated not just through content and epistemic positioning to language, but through multimodal signals including infer task-relevant information for which evidence gestures, tone of voice, and interaction with has been raised, or which the group has the physical environment (VanderHoeven et al., agreed is factual; 2024b). Since one of the critical capabilities that makes human-human collaboration so successful is A modular, extensible architecture adaptable the human ability to interpret multiple coordinated to new tasks and scenarios.


Speech Is Not Enough: Interpreting Nonverbal Indicators of Common Knowledge and Engagement

arXiv.org Artificial Intelligence

Our goal is to develop an AI Partner that can provide support for group problem solving and social dynamics. In multi-party working group environments, multimodal analytics is crucial for identifying non-verbal interactions of group members. In conjunction with their verbal participation, this creates an holistic understanding of collaboration and engagement that provides necessary context for the AI Partner. In this demo, we illustrate our present capabilities at detecting and tracking nonverbal behavior in student task-oriented interactions in the classroom, and the implications for tracking common ground and engagement.


Common Ground Tracking in Multimodal Dialogue

arXiv.org Artificial Intelligence

Within Dialogue Modeling research in AI and NLP, considerable attention has been spent on ``dialogue state tracking'' (DST), which is the ability to update the representations of the speaker's needs at each turn in the dialogue by taking into account the past dialogue moves and history. Less studied but just as important to dialogue modeling, however, is ``common ground tracking'' (CGT), which identifies the shared belief space held by all of the participants in a task-oriented dialogue: the task-relevant propositions all participants accept as true. In this paper we present a method for automatically identifying the current set of shared beliefs and ``questions under discussion'' (QUDs) of a group with a shared goal. We annotate a dataset of multimodal interactions in a shared physical space with speech transcriptions, prosodic features, gestures, actions, and facets of collaboration, and operationalize these features for use in a deep neural model to predict moves toward construction of common ground. Model outputs cascade into a set of formal closure rules derived from situated evidence and belief axioms and update operations. We empirically assess the contribution of each feature type toward successful construction of common ground relative to ground truth, establishing a benchmark in this novel, challenging task.


How Good is Automatic Segmentation as a Multimodal Discourse Annotation Aid?

arXiv.org Artificial Intelligence

Collaborative problem solving (CPS) in teams is tightly coupled with the creation of shared meaning between participants in a situated, collaborative task. In this work, we assess the quality of different utterance segmentation techniques as an aid in annotating CPS. We (1) manually transcribe utterances in a dataset of triads collaboratively solving a problem involving dialogue and physical object manipulation, (2) annotate collaborative moves according to these gold-standard transcripts, and then (3) apply these annotations to utterances that have been automatically segmented using toolkits from Google and OpenAI's Whisper. We show that the oracle utterances have minimal correspondence to automatically segmented speech, and that automatically segmented speech using different segmentation methods is also inconsistent. We also show that annotating automatically segmented speech has distinct implications compared with annotating oracle utterances--since most annotation schemes are designed for oracle cases, when annotating automatically-segmented utterances, annotators must invoke other information to make arbitrary judgments which other annotators may not replicate. We conclude with a discussion of how future annotation specs can account for these needs.