Generalizing Conversational Dense Retrieval via LLM-Cognition Data Augmentation

Chen, Haonan, Dou, Zhicheng, Mao, Kelong, Liu, Jiongnan, Zhao, Ziliang

Feb-10-2024–arXiv.org Artificial Intelligence

Conversational search utilizes muli-turn natural language contexts to retrieve relevant passages. Existing conversational dense retrieval models mostly view a conversation as a fixed sequence of questions and responses, overlooking the severe data sparsity problem -- that is, users can perform a conversation in various ways, and these alternate conversations are unrecorded. Consequently, they often struggle to generalize to diverse conversations in real-world scenarios. In this work, we propose a framework for generalizing Conversational dense retrieval via LLM-cognition data Augmentation (ConvAug). ConvAug first generates multi-level augmented conversations to capture the diverse nature of conversational contexts. Inspired by human cognition, we devise a cognition-aware process to mitigate the generation of false positives, false negatives, and hallucinations. Moreover, we develop a difficulty-adaptive sample filter that selects challenging samples for complex conversations, thereby giving the model a larger learning space. A contrastive learning objective is then employed to train a better conversational context encoder. Extensive experiments conducted on four public datasets, under both normal and zero-shot settings, demonstrate the effectiveness, generalizability, and applicability of ConvAug.

comprehension synthesis, computational linguistic, onv, (12 more...)

arXiv.org Artificial Intelligence

Feb-10-2024

arXiv.org PDF

Add feedback

Country:
- Pacific Ocean > North Pacific Ocean
  - San Francisco Bay > Golden Gate (0.04)
- Oceania > Australia
  - Queensland (0.04)
- North America
  - United States
    - Texas > Travis County
      - Austin (0.04)
    - New York > New York County
      - New York City (0.04)
    - Maryland > Montgomery County
      - Gaithersburg (0.04)
    - California > Los Angeles County
      - Long Beach (0.04)
  - Canada > Ontario
    - Toronto (0.04)
- Europe
  - Italy (0.04)
  - Denmark (0.04)
  - Belgium (0.04)
  - Austria (0.04)
  - Spain > Galicia
    - Madrid (0.04)
- Asia
  - Singapore (0.04)
  - China (0.04)
  - Taiwan > Taiwan Province
    - Taipei (0.04)
  - Middle East > UAE
    - Abu Dhabi Emirate > Abu Dhabi (0.14)

Genre:
- Research Report > New Finding (0.46)

Industry:
- Media > Film (1.00)
- Leisure & Entertainment (1.00)

Technology:
- Information Technology > Artificial Intelligence
  - Natural Language > Large Language Model (1.00)
  - Machine Learning (1.00)