Source2Synth: Synthetic Data Generation and Curation Grounded in Real Data Sources

Lupidi, Alisia, Gemmell, Carlos, Cancedda, Nicola, Dwivedi-Yu, Jane, Weston, Jason, Foerster, Jakob, Raileanu, Roberta, Lomeli, Maria

Sep-12-2024–arXiv.org Artificial Intelligence

Large Language Models still struggle in challenging scenarios that leverage structured data, complex reasoning, or tool usage. In this paper, we propose Source2Synth: a new method that can be used for teaching LLMs new skills without relying on costly human annotations. Source2Synth takes as input a custom data source and produces synthetic data points with intermediate reasoning steps grounded in real-world sources. Source2Synth improves the dataset quality by discarding low-quality generations based on their answerability. We demonstrate the generality of this approach by applying it to two challenging domains: we test reasoning abilities in multi-hop question answering (MHQA), and tool usage in tabular question answering (TQA). Our method improves performance by 25.51% for TQA on WikiSQL and 22.57% for MHQA on HotPotQA compared to the fine-tuned baselines.

dataset, llm, source2synth, (17 more...)

arXiv.org Artificial Intelligence

Sep-12-2024

arXiv.org PDF

Add feedback

Country:
- North America
  - United States (0.47)
  - Mexico (0.04)
  - Canada (0.04)
- Europe > United Kingdom
  - England > Oxfordshire > Oxford (0.04)
- Asia
  - British Indian Ocean Territory > Diego Garcia (0.04)
  - Middle East
    - UAE (0.04)
    - Saudi Arabia > Asir Province
      - Abha (0.04)
  - China > Beijing
    - Beijing (0.04)

Genre:
- Research Report (0.50)

Technology:
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found