Extract-0: A Specialized Language Model for Document Information Extraction
–arXiv.org Artificial Intelligence
This paper presents Extract-0, a 7-billion parameter language model specifically optimized for document information extraction that achieves performance exceeding models with parameter counts several orders of magnitude larger. Through a novel combination of synthetic data generation, supervised fine-tuning with Low-Rank Adaptation (LoRA), and reinforcement learning via Group Relative Policy Optimization (GRPO), Extract-0 achieves a mean reward of 0.573 on a benchmark of 1,000 diverse document extraction tasks, outperforming GPT-4.1 (0.457), o3 (0.464), and GPT-4.1-2025 (0.459). The training methodology employs a memory-preserving synthetic data generation pipeline that produces 280,128 training examples from diverse document sources, followed by parameterefficient fine-tuning that modifies only 0.53% of model weights (40.4M out of 7.66B parameters). The reinforcement learning phase introduces a novel semantic similarity-based reward function that handles the inherent ambiguity in information extraction tasks. This research demonstrates that task-specific optimization can yield models that surpass general-purpose systems while requiring substantially fewer computational resource.
arXiv.org Artificial Intelligence
Sep-30-2025
- Country:
- Asia > Middle East
- Jordan (0.04)
- North America > United States
- Michigan > Isabella County (0.04)
- South America > Brazil
- São Paulo (0.04)
- Asia > Middle East
- Genre:
- Research Report (1.00)
- Industry:
- Banking & Finance (0.46)
- Government (0.46)
- Law (0.47)
- Technology: