CrowdVLM-R1: Expanding R1 Ability to Vision Language Model for Crowd Counting using Fuzzy Group Relative Policy Reward

Wang, Zhiqiang, Feng, Pengbin, Lin, Yanbin, Cai, Shuzhang, Bian, Zongao, Yan, Jinghua, Zhu, Xingquan

arXiv.org Artificial Intelligence 

CrowdVLM-R1: Expanding R1 Ability to Vision Language Model for Crowd Counting using Fuzzy Group Relative Policy Reward 1 st Zhiqiang Wang Florida Atlantic University Boca Raton, USA zwang2022@fau.edu 2 nd Pengbin Feng University of Southern California Los Angeles, USA fengpengbin.apply@gmail.com Abstract--We propose CrowdVLM-R1, which expands the R1 base model for accurate crowd counting, using a novel framework that integrates the fuzzy group relative policy optimization reward function (FGRPR) to enhance learning efficiency. Unlike the conventional binary (0/1) accuracy reward, our fuzzy reward model, FGRPR, which contains both format and precision rewards, provides nuanced incentives to encourage the R1 model to learn to adjust policies towards precise outputs. Supervised fine-tuning (SFT) is also integrated for the CrowdVLM-R1 model to learn from a handful of inputs to enable both in-domain and out-of-domain counting. Experimental results demonstrate that GRPO with a standard binary accuracy reward underperforms compared to SFT . In contrast, FGRPR, applied to Qwen2.5-VL-(3B/7B), surpasses all baseline models, including GPT -4o, LLaMA2-70B and SFT, in five domain datasets. For out-of-domain datasets, FGRPR achieves performance comparable to SFT but excels when target values are larger, as its fuzzy reward function assigns higher rewards to closer approximations. This approach is broadly applicable to tasks where the precision of the answer is critical. I. INTRODUCTION Recently, DeepSeek R1 [1] has drawn much attention among advances in large language models (LLMs), as it demonstrates how reinforcement learning (RL) can be the primary driver of reasoning.