When Crowd Meets Persona: Creating a Large-Scale Open-Domain Persona Dialogue Corpus

Cho, Won Ik, Lee, Yoon Kyung, Bae, Seoyeon, Kim, Jihwan, Park, Sangah, Kim, Moosung, Hahn, Sowon, Kim, Nam Soo

Apr-1-2023–arXiv.org Artificial Intelligence

Building a natural language dataset requires caution since word semantics is vulnerable to subtle text change or the definition of the annotated concept. Such a tendency can be seen in generative tasks like question-answering and dialogue generation and also in tasks that create a categorization-based corpus, like topic classification or sentiment analysis. Open-domain conversations involve two or more crowdworkers freely conversing about any topic, and collecting such data is particularly difficult for two reasons: 1) the dataset should be ``crafted" rather than ``obtained" due to privacy concerns, and 2) paid creation of such dialogues may differ from how crowdworkers behave in real-world settings. In this study, we tackle these issues when creating a large-scale open-domain persona dialogue corpus, where persona implies that the conversation is performed by several actors with a fixed persona and user-side workers from an unspecified crowd.

artificial intelligence, dialogue, natural language, (20 more...)

arXiv.org Artificial Intelligence

Apr-1-2023

arXiv.org PDF

Add feedback

Country:
- North America > United States
  - Texas (0.04)
- Europe
  - United Kingdom > England
    - Oxfordshire > Oxford (0.04)
  - Netherlands > South Holland
    - Leiden (0.04)
  - Czechia > South Moravian Region
    - Brno (0.04)
- Asia > South Korea
  - Seoul > Seoul (0.04)

Genre:
- Research Report > New Finding (1.00)
- Personal > Interview (0.93)

Industry:
- Health & Medicine > Therapeutic Area (0.67)
- Information Technology > Security & Privacy (0.66)

Technology:
- Information Technology
  - Communications (1.00)
  - Artificial Intelligence > Natural Language
    - Chatbot (0.46)
    - Discourse & Dialogue (0.34)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found