Modular Techniques for Synthetic Long-Context Data Generation in Language Model Training and Evaluation
Subramanian, Seganrasan, Verma, Abhigya
–arXiv.org Artificial Intelligence
The ability of large language models (LLMs) to process and reason over long textual inputs is critical for a wide range of real-world applications. However, progress in this area is significantly constrained by the absence of high-quality, diverse, and verifiable long-context datasets suitable for both training and evaluation. This work introduces a modular, extensible framework for synthetic long-context data generation via prompt-based interaction with LLMs. The framework supports multiple training and alignment objectives, including Supervised Fine-Tuning (SFT), Direct Preference Optimization (DPO), and Group Relative Policy Optimization (GRPO). It encompasses four core generation paradigms: multi-turn conversational dialogues, document-grounded input-output pairs, verifiable instruction-response tasks, and long-context reasoning examples. Through templated prompting, a model-agnostic architecture, and metadata-enriched outputs, the proposed approach facilitates scalable, controllable, and purpose-aligned dataset creation for advancing long-context capabilities in LLMs.
arXiv.org Artificial Intelligence
Sep-5-2025
- Country:
- North America > United States (0.68)
- Europe (0.46)
- Genre:
- Research Report (0.82)
- Industry:
- Law (1.00)
- Education (0.68)
- Information Technology > Security & Privacy (0.67)
- Technology: