AITopics | gabriel skantze

Collaborating Authors

gabriel skantze

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

Visual Cues Enhance Predictive Turn-Taking for Two-Party Human Interaction

Russell, Sam O'Connor, Harte, Naomi

arXiv.org Artificial IntelligenceOct-27-2025

Turn-taking is richly multimodal. Predictive turn-taking models (PTTMs) facilitate naturalistic human-robot interaction, yet most rely solely on speech. We introduce MM-VAP, a multimodal PTTM which combines speech with visual cues including facial expression, head pose and gaze. We find that it outperforms the state-of-the-art audio-only in videoconferencing interactions (84% vs. 79% hold/shift prediction accuracy). Unlike prior work which aggregates all holds and shifts, we group by duration of silence between turns. This reveals that through the inclusion of visual features, MM-VAP outperforms a state-of-the-art audio-only turn-taking model across all durations of speaker transitions. We conduct a detailed ablation study, which reveals that facial expression features contribute the most to model performance. Thus, our working hypothesis is that when interlocutors can see one another, visual cues are vital for turn-taking and must therefore be included for accurate turn-taking prediction. We additionally validate the suitability of automatic speech alignment for PTTM training using telephone speech. This work represents the first comprehensive analysis of multimodal PTTMs. We discuss implications for future work and make all code publicly available.

machine learning, natural language, skantze, (19 more...)

arXiv.org Artificial Intelligence

doi: 10.18653/v1/2025.findings-acl.12

2505.21043

Country:

Europe > Ireland > Leinster > County Dublin > Dublin (0.14)
North America > United States > Maryland > Baltimore (0.04)
Europe > Netherlands > South Holland > Rotterdam (0.04)
(2 more...)

Genre: Research Report > New Finding (1.00)

Technology:

Information Technology > Artificial Intelligence > Speech (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.94)
Information Technology > Artificial Intelligence > Vision > Face Recognition (0.70)

Add feedback

Prompt-Guided Turn-Taking Prediction

Inoue, Koji, Elmers, Mikey, Fu, Yahui, Pang, Zi Haur, Lala, Divesh, Ochi, Keiko, Kawahara, Tatsuya

arXiv.org Artificial IntelligenceJul-4-2025

Turn-taking prediction models are essential components in spoken dialogue systems and conversational robots. Recent approaches leverage transformer-based architectures to predict speech activity continuously and in real-time. In this study, we propose a novel model that enables turn-taking prediction to be dynamically controlled via textual prompts. This approach allows intuitive and explicit control through instructions such as "faster" or "calmer" adapting dynamically to conversational partners and contexts. The proposed model builds upon a transformer-based voice activity projection (VAP) model, incorporating textual prompt embeddings into both channel-wise transformers and a cross-channel transformer. We evaluated the feasibility of our approach using over 950 hours of human-human spoken dialogue data. Since textual prompt data for the proposed approach was not available in existing datasets, we utilized a large language model (LLM) to generate synthetic prompt sentences. Experimental results demonstrated that the proposed model improved prediction accuracy and effectively varied turn-taking timing behaviors according to the textual prompts.

large language model, machine learning, natural language, (21 more...)

arXiv.org Artificial Intelligence

2506.21191

Country: Asia > Japan > Honshū > Kansai > Kyoto Prefecture > Kyoto (0.05)

Genre: Research Report > New Finding (0.89)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.88)

Add feedback

Robot Talk Episode 125 – Chatting with robots, with Gabriel Skantze

RobohubJun-13-2025, 12:25:50 GMT

Gabriel Skantze is a Professor of Speech Communication and Technology at KTH Royal Institute of Technology. He specializes in conversational systems and leads several research projects on conversational AI and human-robot interaction. His work focuses on computational models of spoken interaction, integrating both verbal and non-verbal aspects such as prosody, turn-taking, feedback, and joint attention. In 2014, he co-founded Furhat Robotics, where he continues to serve part-time as Chief Scientist.

artificial intelligence, gabriel skantze, robot talk episode 125, (3 more...)

Robohub

Technology: Information Technology > Artificial Intelligence > Robots (1.00)

Add feedback

Large Language Models Know What To Say But Not When To Speak

Umair, Muhammad, Sarathy, Vasanth, de Ruiter, JP

arXiv.org Artificial IntelligenceOct-21-2024

Turn-taking is a fundamental mechanism in human communication that ensures smooth and coherent verbal interactions. Recent advances in Large Language Models (LLMs) have motivated their use in improving the turn-taking capabilities of Spoken Dialogue Systems (SDS), such as their ability to respond at appropriate times. However, existing models often struggle to predict opportunities for speaking -- called Transition Relevance Places (TRPs) -- in natural, unscripted conversations, focusing only on turn-final TRPs and not within-turn TRPs. To address these limitations, we introduce a novel dataset of participant-labeled within-turn TRPs and use it to evaluate the performance of state-of-the-art LLMs in predicting opportunities for speaking. Our experiments reveal the current limitations of LLMs in modeling unscripted spoken interactions, highlighting areas for improvement and paving the way for more naturalistic dialogue systems.

large language model, machine learning, natural language, (20 more...)

arXiv.org Artificial Intelligence

2410.16044

Country:

North America > United States > Illinois > Cook County > Chicago (0.04)
North America > United States > Massachusetts > Middlesex County > Medford (0.04)
North America > Mexico > Mexico City > Mexico City (0.04)
(5 more...)

Genre: Research Report > New Finding (0.46)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.93)

Add feedback

Multilingual Turn-taking Prediction Using Voice Activity Projection

Inoue, Koji, Jiang, Bing'er, Ekstedt, Erik, Kawahara, Tatsuya, Skantze, Gabriel

arXiv.org Artificial IntelligenceMar-14-2024

This paper investigates the application of voice activity projection (VAP), a predictive turn-taking model for spoken dialogue, on multilingual data, encompassing English, Mandarin, and Japanese. The VAP model continuously predicts the upcoming voice activities of participants in dyadic dialogue, leveraging a cross-attention Transformer to capture the dynamic interplay between participants. The results show that a monolingual VAP model trained on one language does not make good predictions when applied to other languages. However, a multilingual model, trained on all three languages, demonstrates predictive performance on par with monolingual models across all languages. Further analyses show that the multilingual model has learned to discern the language of the input signal. We also analyze the sensitivity to pitch, a prosodic cue that is thought to be important for turn-taking. Finally, we compare two different audio encoders, contrastive predictive coding (CPC) pre-trained on English, with a recent model based on multilingual wav2vec 2.0 (MMS).

dialogue, prediction, vap model, (15 more...)

arXiv.org Artificial Intelligence

2403.06487

Country:

Asia > Japan > Honshū > Kansai > Kyoto Prefecture > Kyoto (0.04)
Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
Europe > Sweden (0.04)
Asia > Japan > Shikoku > Ehime Prefecture > Matsuyama (0.04)

Genre: Research Report > New Finding (0.67)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Discourse & Dialogue (0.91)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)

Add feedback

Real-time and Continuous Turn-taking Prediction Using Voice Activity Projection

Inoue, Koji, Jiang, Bing'er, Ekstedt, Erik, Kawahara, Tatsuya, Skantze, Gabriel

arXiv.org Artificial IntelligenceJan-9-2024

A demonstration of a real-time and continuous turn-taking prediction system is presented. The system is based on a voice activity projection (VAP) model, which directly maps dialogue stereo audio to future voice activities. The VAP model includes contrastive predictive coding (CPC) and self-attention transformers, followed by a cross-attention transformer. We examine the effect of the input context audio length and demonstrate that the proposed system can operate in real-time with CPU settings, with minimal performance degradation.

participant, real-time and continuous turn-taking prediction, transformer, (9 more...)

arXiv.org Artificial Intelligence

2401.04868

Country:

Asia > Japan > Honshū > Kansai > Kyoto Prefecture > Kyoto (0.06)
Europe > Sweden (0.05)
Asia > Japan > Shikoku > Ehime Prefecture > Matsuyama (0.04)

Genre: Research Report (0.40)

Technology:

Information Technology > Architecture > Real Time Systems (0.86)
Information Technology > Artificial Intelligence > Natural Language > Discourse & Dialogue (0.53)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.48)

Add feedback

When can I Speak? Predicting initiation points for spoken dialogue agents

Li, Siyan, Paranjape, Ashwin, Manning, Christopher D.

arXiv.org Artificial IntelligenceAug-7-2022

Current spoken dialogue systems initiate their turns after a long period of silence (700-1000ms), which leads to little real-time feedback, sluggish responses, and an overall stilted conversational flow. Humans typically respond within 200ms and successfully predicting initiation points in advance would allow spoken dialogue agents to do the same. In this work, we predict the lead-time to initiation using prosodic features from a pre-trained speech representation model (wav2vec 1.0) operating on user audio and word features from a pre-trained language model (GPT-2) operating on incremental transcriptions. To evaluate errors, we propose two metrics w.r.t. predicted and true lead times. We train and evaluate the models on the Switchboard Corpus and find that our method outperforms features from prior work on both metrics and vastly outperforms the common approach of waiting for 700ms of silence.

baseline, initiation, skantze, (15 more...)

arXiv.org Artificial Intelligence

2208.03812

Country:

North America > United States > California > Santa Clara County > Palo Alto (0.04)
Europe > Germany > Saarland > Saarbrücken (0.04)
Asia > Singapore (0.04)

Genre: Research Report (0.40)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Discourse & Dialogue (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

Would you be happy being interviewed by a robot?

BBC NewsMar-12-2019, 03:41:30 GMT

The world's first robot designed to carry out unbiased job interviews is being tested by Swedish recruiters. But can it really do a better job than humans? Measuring 41cm (16in) tall and weighing 35kg (77lbs) she's at eye level as she sits on top of a table directly across from the candidate she's about to interview. Her glowing yellow face tilts slightly to the side. Then she blinks and smiles lightly as she poses her first question: "Have you ever been interviewed by a robot before?"

artificial intelligence, interview, robot, (15 more...)

BBC News

AI-Alerts: 2019 > 2019-03 > AAAI AI-Alert for Mar 12, 2019 (1.00)

Country:

Europe > Sweden > Stockholm > Stockholm (0.06)
Europe > United Kingdom (0.05)
Europe > Netherlands > North Holland > Amsterdam (0.05)
Asia > China > Shanghai > Shanghai (0.05)

Industry: Media > News (0.40)

Technology: Information Technology > Artificial Intelligence > Robots (1.00)

Add feedback