CLIPPER: Compression enables long-context synthetic data generation
Pham, Chau Minh, Chang, Yapei, Iyyer, Mohit
–arXiv.org Artificial Intelligence
LLM developers are increasingly reliant on synthetic data, but generating high-quality data for complex long-context reasoning tasks remains challenging. We introduce CLIPPER, a compression-based approach for generating synthetic data tailored to narrative claim verification - a task that requires reasoning over a book to verify a given claim. Instead of generating claims directly from the raw text of the book, which results in artifact-riddled claims, CLIPPER first compresses the book into chapter outlines and book summaries and then uses these intermediate representations to generate complex claims and corresponding chain-of-thoughts. Compared to naive approaches, CLIPPER produces claims that are more valid, grounded, and complex. Using CLIPPER, we construct a dataset of 19K synthetic book claims paired with their source texts and chain-of-thought reasoning, and use it to fine-tune three open-weight models. Our best model achieves breakthrough results on narrative claim verification (from 28% to 76% accuracy on our test set) and sets a new state-of-the-art for sub-10B models on the NoCha leaderboard. Further analysis shows that our models generate more detailed and grounded chain-of-thought reasoning while also improving performance on other narrative understanding tasks (e.g., NarrativeQA).
arXiv.org Artificial Intelligence
Feb-20-2025
- Country:
- Asia
- China > Guangxi Province
- Nanning (0.04)
- Myanmar > Tanintharyi Region
- Dawei (0.04)
- Singapore (0.04)
- China > Guangxi Province
- Europe
- Italy > Calabria
- Catanzaro Province > Catanzaro (0.04)
- Spain > Catalonia
- Barcelona Province > Barcelona (0.04)
- Italy > Calabria
- North America
- Canada > Ontario
- Toronto (0.04)
- United States
- Louisiana > Orleans Parish
- New Orleans (0.04)
- Maryland > Prince George's County
- College Park (0.04)
- Massachusetts
- Hampshire County > Amherst (0.04)
- Middlesex County > Cambridge (0.04)
- Minnesota (0.04)
- Louisiana > Orleans Parish
- Canada > Ontario
- Oceania > Australia
- South America > French Guiana
- Asia
- Genre:
- Research Report > New Finding (0.48)
- Technology: