ORPO-Distill: Mixed-Policy Preference Optimization for Cross-Architecture LLM Distillation
Singh, Aasheesh, Vaddina, Vishal, Birru, Dagnachew
–arXiv.org Artificial Intelligence
We introduce ORPO-Distill, a general-purpose method for cross-architecture LLM distillation that formulates the problem as a preference optimization task. Unlike standard CoT distillation, the approach transfers knowledge through diverse reasoning traces. It employs an Odds-Ratio Preference Optimization objective that contrasts teacher and student traces for more effective learning, and adopts a mixed-policy strategy for utilizing student-generated outputs, outperforming both off- and on-policy alternatives. Experiments on five datasets and multiple student models show consistent improvements over conventional black-box KD baselines.
arXiv.org Artificial Intelligence
Sep-30-2025
- Country:
- North America
- Canada (0.16)
- United States (0.15)
- North America
- Genre:
- Research Report (0.40)
- Industry:
- Education (0.55)
- Technology: