SODA: Million-scale Dialogue Distillation with Social Commonsense Contextualization

Kim, Hyunwoo, Hessel, Jack, Jiang, Liwei, West, Peter, Lu, Ximing, Yu, Youngjae, Zhou, Pei, Bras, Ronan Le, Alikhani, Malihe, Kim, Gunhee, Sap, Maarten, Choi, Yejin

Oct-23-2023–arXiv.org Artificial Intelligence

Data scarcity has been a long standing issue in the field of open-domain social dialogue. To quench this thirst, we present SODA: the first publicly available, million-scale high-quality social dialogue dataset. By contextualizing social commonsense knowledge from a knowledge graph, we are able to distill an exceptionally broad spectrum of social interactions from a large language model. Human evaluation shows that conversations in SODA are more consistent, specific, and (surprisingly) natural than those in prior human-authored datasets. Using SODA, we train COSMO: a generalizable conversation model that is significantly more natural and consistent on unseen datasets than best-performing conversation models (e.g., GODEL, BlenderBot-1, Koala, Vicuna). Experiments reveal COSMO is sometimes even preferred to the original human-written gold responses. Additionally, our results shed light on the distinction between knowledge-enriched conversations and natural social chitchats. We plan to make our data, model, and code public.

computational linguistic, dataset, dialogue, (14 more...)

arXiv.org Artificial Intelligence

Oct-23-2023

arXiv.org PDF

Add feedback

Country:
- Oceania > Australia
  - Victoria > Melbourne (0.04)
- North America
  - Dominican Republic (0.04)
  - United States
    - California (0.14)
    - Oregon > Multnomah County
      - Portland (0.04)
    - Minnesota > Hennepin County
      - Minneapolis (0.14)
- Europe
  - United Kingdom > Scotland
    - City of Edinburgh > Edinburgh (0.04)
  - Italy > Tuscany
    - Florence (0.04)
  - Ireland > Leinster
    - County Dublin > Dublin (0.04)
  - Croatia > Dubrovnik-Neretva County
    - Dubrovnik (0.04)
- Asia
  - Singapore (0.04)
  - China > Hong Kong (0.04)
  - Taiwan > Taiwan Province
    - Taipei (0.04)
  - South Korea > Seoul
    - Seoul (0.04)
  - Middle East > UAE
    - Abu Dhabi Emirate > Abu Dhabi (0.04)

Genre:
- Research Report > New Finding (0.88)

Industry:
- Education > Educational Setting (0.93)
- Government (0.67)
- Law > Statutes (0.67)

Technology:
- Information Technology
  - Communications > Social Media (1.00)
  - Artificial Intelligence
    - Representation & Reasoning (1.00)
    - Cognitive Science (1.00)
    - Natural Language
      - Large Language Model (1.00)
      - Chatbot (0.98)
    - Machine Learning > Neural Networks
      - Deep Learning (0.50)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found