The Alignment Paradox of Medical Large Language Models in Infertility Care: Decoupling Algorithmic Improvement from Clinical Decision-making Quality

Liu, Dou, Long, Ying, Zuoqiu, Sophia, Xie, Kaipeng, Yang, Runze, Liu, Di, Li, Kang, Lin, Yiting, Liu, Hanyi, Yin, Rong, Tang, Tian

Nov-25-2025–arXiv.org Artificial Intelligence

Large language models (LLMs) are increasingly adopted in clinical decision support, yet aligning them with the multifaceted reasoning pathways of real-world medicine remains a major challenge. Using more than 8,000 infertility treatment records, we systematically evaluate four alignment strategies: Supervised Fine-Tuning (SFT), Direct Preference Optimization (DPO), Group Relative Policy Optimization (GRPO), and In-Context Learning (ICL) through a dual-layer framework combining automatic benchmarks with blinded doctor-in-the-loop assessments. GRPO achieves the highest algorithmic accuracy across multiple decision layers, confirming the value of reinforcement-based optimization for structured prediction tasks. However, clinicians consistently prefer the SFT model, citing clearer reasoning processes (p = 0.035) and higher therapeutic feasibility (p = 0.019). In blinded pairwise comparisons, SFT attains the highest winning rate (51.2%), outperforming both GRPO (26.2%) and even physicians' original decisions (22.7%). These results reveal an alignment paradox: algorithmic improvements do not necessarily translate into higher clinical trust, and may diverge from human-centered preferences. Our findings highlight the need for alignment strategies that prioritize clinically interpretable and practically feasible reasoning, rather than solely optimizing decision-level accuracy.

grpo, large language model, machine learning, (19 more...)

arXiv.org Artificial Intelligence

Nov-25-2025

arXiv.org PDF

Add feedback

Country:
- Asia
  - China > Sichuan Province (0.14)
  - Middle East > Jordan (0.04)
- North America > United States
  - Kansas > Sheridan County (0.04)
  - Michigan (0.04)

Genre:
- Research Report
  - Experimental Study (1.00)
  - New Finding (1.00)

Industry:
- Health & Medicine
  - Diagnostic Medicine (1.00)
  - Therapeutic Area > Obstetrics/Gynecology (0.46)

Technology:
- Information Technology > Artificial Intelligence
  - Cognitive Science > Problem Solving (1.00)
  - Machine Learning
    - Neural Networks > Deep Learning (0.93)
    - Performance Analysis > Accuracy (0.68)
  - Natural Language > Large Language Model (1.00)