Extract-0: A Specialized Language Model for Document Information Extraction

Sep-30-2025–arXiv.org Artificial Intelligence

This paper presents Extract-0, a 7-billion parameter language model specifically optimized for document information extraction that achieves performance exceeding models with parameter counts several orders of magnitude larger. Through a novel combination of synthetic data generation, supervised fine-tuning with Low-Rank Adaptation (LoRA), and reinforcement learning via Group Relative Policy Optimization (GRPO), Extract-0 achieves a mean reward of 0.573 on a benchmark of 1,000 diverse document extraction tasks, outperforming GPT-4.1 (0.457), o3 (0.464), and GPT-4.1-2025 (0.459). The training methodology employs a memory-preserving synthetic data generation pipeline that produces 280,128 training examples from diverse document sources, followed by parameterefficient fine-tuning that modifies only 0.53% of model weights (40.4M out of 7.66B parameters). The reinforcement learning phase introduces a novel semantic similarity-based reward function that handles the inherent ambiguity in information extraction tasks. This research demonstrates that task-specific optimization can yield models that surpass general-purpose systems while requiring substantially fewer computational resource.

machine learning, natural language, reinforcement learning, (19 more...)

arXiv.org Artificial Intelligence

Sep-30-2025

arXiv.org PDF

Add feedback

Country:
- Asia > Middle East
  - Jordan (0.04)
- North America > United States
  - Michigan > Isabella County (0.04)
- South America > Brazil
  - São Paulo (0.04)

Genre:
- Research Report (1.00)

Industry:
- Banking & Finance (0.46)
- Government (0.46)
- Law (0.47)

Technology:
- Information Technology > Artificial Intelligence
  - Machine Learning
    - Inductive Learning (0.72)
    - Reinforcement Learning (0.71)
  - Natural Language
    - Information Extraction (1.00)
    - Text Processing (0.68)