Seamless Interaction: Dyadic Audiovisual Motion Modeling and Large-Scale Dataset

Agrawal, Vasu, Akinyemi, Akinniyi, Alvero, Kathryn, Behrooz, Morteza, Buffalini, Julia, Carlucci, Fabio Maria, Chen, Joy, Chen, Junming, Chen, Zhang, Cheng, Shiyang, Chowdary, Praveen, Chuang, Joe, D'Avirro, Antony, Daly, Jon, Dong, Ning, Duppenthaler, Mark, Gao, Cynthia, Girard, Jeff, Gleize, Martin, Gomez, Sahir, Gong, Hongyu, Govindarajan, Srivathsan, Han, Brandon, He, Sen, Hernandez, Denise, Hristov, Yordan, Huang, Rongjie, Inaguma, Hirofumi, Jain, Somya, Janardhan, Raj, Jia, Qingyao, Klaiber, Christopher, Kovachev, Dejan, Kumar, Moneish, Li, Hang, Li, Yilei, Litvin, Pavel, Liu, Wei, Ma, Guangyao, Ma, Jing, Ma, Martin, Ma, Xutai, Mantovani, Lucas, Miglani, Sagar, Mohan, Sreyas, Morency, Louis-Philippe, Ng, Evonne, Ng, Kam-Woh, Nguyen, Tu Anh, Oberai, Amia, Peloquin, Benjamin, Pino, Juan, Popovic, Jovan, Poursaeed, Omid, Prada, Fabian, Rakotoarison, Alice, Ranjan, Rakesh, Richard, Alexander, Ropers, Christophe, Saleem, Safiyyah, Sharma, Vasu, Shcherbyna, Alex, Shen, Jia, Shen, Jie, Stathopoulos, Anastasis, Sun, Anna, Tomasello, Paden, Tran, Tuan, Turkatenko, Arina, Wan, Bo, Wang, Chao, Wang, Jeff, Williamson, Mary, Wood, Carleigh, Xiang, Tao, Yang, Yilin, Yao, Julien, Zhang, Chen, Zhang, Jiemin, Zhang, Xinyue, Zheng, Jason, Zhyzheria, Pavlo, Zikes, Jan, Zollhoefer, Michael

Jul-2-2025–arXiv.org Artificial Intelligence

Human communication involves a complex interplay of verbal and nonverbal signals, essential for conveying meaning and achieving interpersonal goals. To develop socially intelligent AI technologies, it is crucial to develop models that can both comprehend and generate dyadic behavioral dynamics. To this end, we introduce the Seamless Interaction Dataset, a large-scale collection of over 4,000 hours of face-to-face interaction footage from over 4,000 participants in diverse contexts. This dataset enables the development of AI technologies that understand dyadic embodied dynamics, unlocking breakthroughs in virtual agents, telepresence experiences, and multimodal content analysis tools. We also develop a suite of models that utilize the dataset to generate dyadic motion gestures and facial expressions aligned with human speech. These models can take as input both the speech and visual behavior of their interlocutors. We present a variant with speech from an LLM model and integrations with 2D and 3D rendering methods, bringing us closer to interactive virtual agents. Additionally, we describe controllable variants of our motion models that can adapt emotional responses and expressivity levels, as well as generating more semantically-relevant gestures. Finally, we discuss methods for assessing the quality of these dyadic motion models, which are demonstrating the potential for more intuitive and responsive human-AI interactions.

large language model, machine learning, natural language, (15 more...)

arXiv.org Artificial Intelligence

Jul-2-2025

arXiv.org PDF

Add feedback

Country:
- Africa > Rwanda
  - Kigali > Kigali (0.04)
- Asia
  - China (0.04)
  - Japan > Honshū
    - Chūbu > Ishikawa Prefecture > Kanazawa (0.04)
- Europe
  - Austria (0.04)
  - France > Île-de-France
    - Paris > Paris (0.04)
  - United Kingdom > England
    - Cambridgeshire > Cambridge (0.04)
    - Oxfordshire > Oxford (0.04)
- North America
  - Canada (0.04)
  - United States
    - Pennsylvania > Allegheny County
      - Pittsburgh (0.04)
    - New York > New York County
      - New York City (0.04)
    - New Jersey > Hudson County
      - Hoboken (0.04)
    - California
      - Los Angeles County > Los Angeles (0.14)
      - Orange County
        Costa Mesa (0.04)
        Irvine (0.04)
      - San Bernardino County > Chino Hills (0.04)
    - Massachusetts
      - Middlesex County > Waltham (0.04)
      - Suffolk County > Boston (0.04)
    - Washington > King County
      - Seattle (0.04)
    - Kansas (0.04)
    - Nevada > Clark County
      - Las Vegas (0.04)
    - Idaho > Ada County
      - Boise (0.04)
- Oceania
  - Australia > Victoria
    - Melbourne (0.04)
  - New Zealand (0.04)

Genre:
- Research Report > Experimental Study (0.92)

Industry:
- Government (0.92)
- Health & Medicine > Therapeutic Area
  - Psychiatry/Psychology (0.87)
- Information Technology > Security & Privacy (1.00)

Technology:
- Information Technology > Artificial Intelligence
  - Cognitive Science (1.00)
  - Machine Learning > Neural Networks
    - Deep Learning (0.92)
  - Natural Language > Large Language Model (1.00)
  - Representation & Reasoning (1.00)
  - Vision > Face Recognition (1.00)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found