chapter outline
CLIPPER: Compression enables long-context synthetic data generation
Pham, Chau Minh, Chang, Yapei, Iyyer, Mohit
LLM developers are increasingly reliant on synthetic data, but generating high-quality data for complex long-context reasoning tasks remains challenging. We introduce CLIPPER, a compression-based approach for generating synthetic data tailored to narrative claim verification - a task that requires reasoning over a book to verify a given claim. Instead of generating claims directly from the raw text of the book, which results in artifact-riddled claims, CLIPPER first compresses the book into chapter outlines and book summaries and then uses these intermediate representations to generate complex claims and corresponding chain-of-thoughts. Compared to naive approaches, CLIPPER produces claims that are more valid, grounded, and complex. Using CLIPPER, we construct a dataset of 19K synthetic book claims paired with their source texts and chain-of-thought reasoning, and use it to fine-tune three open-weight models. Our best model achieves breakthrough results on narrative claim verification (from 28% to 76% accuracy on our test set) and sets a new state-of-the-art for sub-10B models on the NoCha leaderboard. Further analysis shows that our models generate more detailed and grounded chain-of-thought reasoning while also improving performance on other narrative understanding tasks (e.g., NarrativeQA).
- North America > Canada > Ontario > Toronto (0.04)
- North America > United States > Minnesota (0.04)
- Asia > Myanmar > Tanintharyi Region > Dawei (0.04)
- (10 more...)
Generating a full-length work of fiction with GPT-4
The goal of this project was to have the GPT-4 version of ChatGPT, the latest instructional large language model, generate an entire novel from scratch, including the title, genre, story, characters, settings, and all the writing, with no human input. It is impossible currently to do this using a single prompt ("write me a book"), but what is possible is to supply a series of prompts that give structure to the process and allow it to complete this large task, one step at a time. However, in order to ensure that all the creative work is done by GPT-4, prompts are not allowed to make specific references to the content of the book, only the book's structure. The intention is that the process should be simple, mechanical and possible (in principle) to fully automate. Each time the process is repeated from the beginning, it should create another entirely new book, based solely on GPT-4's independent creative choices.