Towards human-like spoken dialogue generation between AI agents from written dialogue

Mitsui, Kentaro, Hono, Yukiya, Sawada, Kei

Oct-2-2023–arXiv.org Artificial Intelligence

The advent of large language models (LLMs) has made it possible to generate natural written dialogues between two agents. However, generating human-like spoken dialogues from these written dialogues remains challenging. Spoken dialogues have several unique characteristics: they frequently include backchannels and laughter, and the smoothness of turn-taking significantly influences the fluidity of conversation. This study proposes CHATS -- CHatty Agents Text-to-Speech -- a discrete token-based system designed to generate spoken dialogues based on written dialogues. Our system can generate speech for both the speaker side and the listener side simultaneously, using only the transcription from the speaker side, which eliminates the need for transcriptions of backchannels or laughter. Moreover, CHATS facilitates natural turn-taking; it determines the appropriate duration of silence after each utterance in the absence of overlap, and it initiates the generation of overlapping speech based on the phoneme sequence of the next utterance in case of overlap. Experimental evaluations indicate that CHATS outperforms the text-to-speech baseline, producing spoken dialogues that are more interactive and fluid while retaining clarity and intelligibility. Large Language Models (LLMs) have profoundly influenced the field of natural language processing (NLP) and artificial intelligence (AI) (Zhao et al., 2023). LLMs, with their capacity to generate coherent and contextually relevant content, have enabled more natural text-based dialogues between humans and computers and paved the way for inter-computer communication. The recently proposed concept of Generative Agents (Park et al., 2023) underscores the potential of LLMs, where emulated agents within the model engage in autonomous dialogues, store information, and initiate actions. This emerging paradigm of agent-to-agent communication offers vast potential across various sectors, from entertainment to facilitating human-to-human information exchange. However, considering the dominance of spoken communication in human interactions, integrating voice into machine dialogues can provide a richer expression of individuality and emotion, offering a more genuine experience.

dialogue, laughter, utterance, (16 more...)

arXiv.org Artificial Intelligence

Oct-2-2023

arXiv.org PDF

Add feedback

Country:
- Oceania > New Zealand
  - South Island > Canterbury Region > Christchurch (0.04)
- North America
  - United States
    - Rhode Island (0.04)
    - Minnesota > Hennepin County
      - Minneapolis (0.14)
    - Louisiana > Orleans Parish
      - New Orleans (0.04)
    - Illinois > Cook County
      - Chicago (0.04)
    - Hawaii > Honolulu County
      - Honolulu (0.04)
    - California
      - San Francisco County > San Francisco (0.04)
      - San Diego County > San Diego (0.04)
      - Alameda County > Oakland (0.04)
  - Canada > Quebec
    - Montreal (0.04)
- Europe
  - Greece (0.04)
  - Ireland > Leinster
    - County Dublin > Dublin (0.04)
  - Germany > Saarland
    - Saarbrücken (0.04)
  - Austria > Styria
    - Graz (0.04)
- Asia
  - Singapore (0.04)
  - South Korea > Incheon
    - Incheon (0.04)
  - Middle East > Qatar
    - Ad-Dawhah > Doha (0.04)
  - Japan > Honshū
    - Kantō > Tokyo Metropolis Prefecture > Tokyo (0.14)
  - India > Telangana
    - Hyderabad (0.04)

Genre:
- Research Report > New Finding (1.00)

Technology:
- Information Technology > Artificial Intelligence
  - Natural Language
    - Large Language Model (1.00)
    - Discourse & Dialogue (1.00)
  - Machine Learning > Neural Networks
    - Deep Learning (0.46)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found