JudgeLRM: Large Reasoning Models as a Judge

Chen, Nuo, Hu, Zhiyuan, Zou, Qingyun, Wu, Jiaying, Wang, Qian, Hooi, Bryan, He, Bingsheng

Mar-30-2025–arXiv.org Artificial Intelligence

The rise of Large Language Models (LLMs) as evaluators offers a scalable alternative to human annotation, yet existing Supervised Fine-Tuning (SFT) for judges approaches often fall short in domains requiring complex reasoning. In this work, we investigate whether LLM judges truly benefit from enhanced reasoning capabilities. Through a detailed analysis of reasoning requirements across evaluation tasks, we reveal a negative correlation between SFT performance gains and the proportion of reasoning-demanding samples - highlighting the limitations of SFT in such scenarios. To address this, we introduce JudgeLRM, a family of judgment-oriented LLMs trained using reinforcement learning (RL) with judge-wise, outcome-driven rewards. JudgeLRM models consistently outperform both SFT-tuned and state-of-the-art reasoning models. Notably, JudgeLRM-3B surpasses GPT-4, and JudgeLRM-7B outperforms DeepSeek-R1 by 2.79% in F1 score, particularly excelling in judge tasks requiring deep reasoning.

large language model, machine learning, natural language, (20 more...)

arXiv.org Artificial Intelligence

Mar-30-2025

arXiv.org PDF

Add feedback

Country:
- Asia > Singapore (0.04)
- North America > United States
  - Florida > Miami-Dade County > Miami (0.04)

Genre:
- Research Report (1.00)

Industry:
- Media (1.00)
- Law > Civil Rights & Constitutional Law (0.68)
- Health & Medicine
  - Consumer Health (0.93)
  - Therapeutic Area > Dermatology (0.46)

Technology:
- Information Technology > Artificial Intelligence
  - Natural Language > Large Language Model (1.00)
  - Machine Learning > Neural Networks
    - Deep Learning (0.89)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found