Leveraging Group Relative Policy Optimization to Advance Large Language Models in Traditional Chinese Medicine
Xie, Jiacheng, Zeng, Shuai, Yu, Yang, Tang, Xiaoting, An, Guanghui, Xu, Dong
–arXiv.org Artificial Intelligence
Traditional Chinese Medicine (TCM) presents a rich and structurally unique knowledge system that challenges conventional applications of large language models (LLMs). Although previous TCM - specific LLMs have shown progress through supervised fine - tuning, they often face limitations in alignment, data quality, and evaluation consistency. In this study, we introduce Ladder - base, the first TCM - focused LLM trained with Group Relative Policy Optimization (GRPO), a reinforcement learning method that improves reasoning and factual consistency by optimizing response selection based on intra - group comparisons. Ladder - base is built upon the Qwen2.5 - 7B - Instruct foundation model and trained exclusively on the textual subset of the TCM - Ladder benchmark, using 80 percent of the data for training and the remaining 20 percent split evenly between validation and test sets. Through standardized evaluation, Ladder - base demonstrates superior performance across multiple reasoning metrics when compared to both state - of - the - art general - purpose LLMs such as GPT - 4, Gemini 2.5, Claude 3, and Qwen3 and domain - specific TCM models including BenTsao, HuatuoGPT2, and Zhongjing. These findings suggest that GRPO provides an effective and efficient strategy for aligning LLMs with expert - level reasoning in traditional medical domains and supports the development of trustworthy and clinically grounded TCM artificial intelligence systems.
arXiv.org Artificial Intelligence
Oct-21-2025
- Country:
- Asia > China
- North America > United States
- Missouri > Boone County > Columbia (0.15)
- Genre:
- Research Report > New Finding (0.69)
- Industry:
- Health & Medicine > Diagnostic Medicine (0.95)
- Technology: