Yang, Chao
Large-scale automatic carbon ion treatment planning for head and neck cancers via parallel multi-agent reinforcement learning
Zhang, Jueye, Yang, Chao, Lai, Youfang, Li, Kai-Wen, Yan, Wenting, Xia, Yunzhou, Zhang, Haimei, Zhou, Jingjing, Yang, Gen, Lin, Chen, Li, Tian, Zhang, Yibao
Head-and-neck cancer (HNC) planning is difficult because multiple critical organs-at-risk (OARs) are close to complex targets. Intensity-modulated carbon-ion therapy (IMCT) offers superior dose conformity and OAR sparing but remains slow due to relative biological effectiveness (RBE) modeling, leading to laborious, experience-based, and often suboptimal tuning of many treatment-planning parameters (TPPs). Recent deep learning (DL) methods are limited by data bias and plan feasibility, while reinforcement learning (RL) struggles to efficiently explore the exponentially large TPP search space. We propose a scalable multi-agent RL (MARL) framework for parallel tuning of 45 TPPs in IMCT. It uses a centralized-training decentralized-execution (CTDE) QMIX backbone with Double DQN, Dueling DQN, and recurrent encoding (DRQN) for stable learning in a high-dimensional, non-stationary environment. To enhance efficiency, we (1) use compact historical DVH vectors as state inputs, (2) apply a linear action-to-value transform mapping small discrete actions to uniform parameter adjustments, and (3) design an absolute, clinically informed piecewise reward aligned with plan scores. A synchronous multi-process worker system interfaces with the PHOENIX TPS for parallel optimization and accelerated data collection. On a head-and-neck dataset (10 training, 10 testing), the method tuned 45 parameters simultaneously and produced plans comparable to or better than expert manual ones (relative plan score: RL $85.93\pm7.85%$ vs Manual $85.02\pm6.92%$), with significant (p-value $<$ 0.05) improvements for five OARs. The framework efficiently explores high-dimensional TPP spaces and generates clinically competitive IMCT plans through direct TPS interaction, notably improving OAR sparing.
- Asia > China > Beijing > Beijing (0.05)
- Asia > China > Hong Kong (0.04)
- South America > Chile > Santiago Metropolitan Region > Santiago Province > Santiago (0.04)
- Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.66)
- Information Technology > Artificial Intelligence > Representation & Reasoning > Agents > Agent Societies (0.46)
Online time series prediction using feature adjustment
Huang, Xiannan, Qiu, Shuhan, Du, Jiayuan, Yang, Chao
Time series forecasting is of significant importance across various domains. However, it faces significant challenges due to distribution shift. This issue becomes particularly pronounced in online deployment scenarios where data arrives sequentially, requiring models to adapt continually to evolving patterns. Current time series online learning methods focus on two main aspects: selecting suitable parameters to update (e.g., final layer weights or adapter modules) and devising suitable update strategies (e.g., using recent batches, replay buffers, or averaged gradients). We challenge the conventional parameter selection approach, proposing that distribution shifts stem from changes in underlying latent factors influencing the data. Consequently, updating the feature representations of these latent factors may be more effective. To address the critical problem of delayed feedback in multi-step forecasting (where true values arrive much later than predictions), we introduce ADAPT-Z (Automatic Delta Adjustment via Persistent Tracking in Z-space). ADAPT-Z utilizes an adapter module that leverages current feature representations combined with historical gradient information to enable robust parameter updates despite the delay. Extensive experiments demonstrate that our method consistently outperforms standard base models without adaptation and surpasses state-of-the-art online learning approaches across multiple datasets. The code is available at https://github.com/xiannanhuang/ADAPT-Z.
- North America > United States (0.14)
- Europe > Germany > Bavaria > Upper Bavaria > Munich (0.04)
- Asia > Singapore (0.04)
- (2 more...)
- Education > Educational Setting (0.54)
- Health & Medicine (0.46)
Critique-GRPO: Advancing LLM Reasoning with Natural Language and Numerical Feedback
Zhang, Xiaoying, Sun, Hao, Zhang, Yipeng, Feng, Kaituo, Lu, Chaochao, Yang, Chao, Meng, Helen
Recent advances in reinforcement learning (RL) with numerical feedback, such as scalar rewards, have significantly enhanced the complex reasoning capabilities of large language models (LLMs). Despite this success, we identify three key challenges encountered by RL with solely numerical feedback: performance plateaus, limited effectiveness of spontaneous self-reflection, and persistent failures. We then demonstrate that RL-finetuned models, even after exhibiting performance plateaus, can generate correct refinements on persistently failed problems by leveraging natural language feedback in the form of critiques. Building on this insight, we propose Critique-GRPO, an online RL framework that integrates both natural language and numerical feedback for effective policy optimization. Critique-GRPO enables LLMs to learn from initial responses and critique-guided self-refinements simultaneously while maintaining exploration. Additionally, we employ a shaping function to amplify learning from correct, especially unfamiliar, refinements and penalize incorrect ones. Extensive experiments with Qwen2.5-7B-Base, Qwen2.5-Math-7B-Base, and Qwen3-8B demonstrate that Critique-GRPO consistently outperforms supervised learning and RL-based fine-tuning methods across eight challenging mathematical, STEM, and general reasoning tasks. Specifically, Critique-GRPO improves average pass@1 scores across all compared methods by approximately +4.4% on Qwen2.5-7B-Base and +3.8% on Qwen3-8B. Notably, Critique-GRPO enables effective self-improvement through self-critiquing, achieving significant gains over GRPO, e.g., +16.7% pass@1 improvement on AIME 2024.
- North America > Canada > British Columbia > Vancouver (0.04)
- Asia > Thailand > Bangkok > Bangkok (0.04)
- Asia > Singapore (0.04)
- (8 more...)
SafeWork-R1: Coevolving Safety and Intelligence under the AI-45$^{\circ}$ Law
Lab, Shanghai AI, :, null, Bao, Yicheng, Chen, Guanxu, Chen, Mingkang, Chen, Yunhao, Chen, Chiyu, Chen, Lingjie, Chen, Sirui, Chen, Xinquan, Cheng, Jie, Cheng, Yu, Deng, Dengke, Ding, Yizhuo, Ding, Dan, Ding, Xiaoshan, Ding, Yi, Dong, Zhichen, Du, Lingxiao, Fan, Yuyu, Feng, Xinshun, Fu, Yanwei, Gao, Yuxuan, Ge, Ruijun, Gu, Tianle, Gui, Lujun, Guo, Jiaxuan, He, Qianxi, Hou, Yuenan, Hu, Xuhao, Huang, Hong, Huang, Kaichen, Huang, Shiyang, Jiang, Yuxian, Lei, Shanzhe, Li, Jie, Li, Lijun, Li, Hao, Li, Juncheng, Li, Xiangtian, Li, Yafu, Li, Lingyu, Li, Xueyan, Liang, Haotian, Liu, Dongrui, Liu, Qihua, Liu, Zhixuan, Liu, Bangwei, Liu, Huacan, Liu, Yuexiao, Liu, Zongkai, Lu, Chaochao, Lu, Yudong, Lu, Xiaoya, Lu, Zhenghao, Lv, Qitan, Ma, Caoyuan, Ma, Jiachen, Ma, Xiaoya, Ma, Zhongtian, Meng, Lingyu, Miao, Ziqi, Niu, Yazhe, Peng, Yuezhang, Pu, Yuan, Qi, Han, Qian, Chen, Qiao, Xingge, Qu, Jingjing, Qu, Jiashu, Qu, Wanying, Qu, Wenwen, Qu, Xiaoye, Ren, Qihan, Ren, Qingnan, Ren, Qingyu, Shao, Jing, Shao, Wenqi, Shao, Shuai, Shi, Dongxing, Song, Xin, Song, Xinhao, Teng, Yan, Tong, Xuan, Wang, Yingchun, Wang, Xuhong, Wang, Shujie, Wang, Xin, Wang, Yige, Wang, Yixu, Wang, Yuanfu, Wang, Futing, Wang, Ruofan, Wang, Wenjie, Wang, Yajie, Wei, Muhao, Wen, Xiaoyu, Weng, Fenghua, Wu, Yuqi, Xiong, Yingtong, Xu, Xingcheng, Yang, Chao, Yang, Yue, Yao, Yang, Ye, Yulei, Yin, Zhenyun, Yu, Yi, Zhang, Bo, Zhang, Qiaosheng, Zhang, Jinxuan, Zhang, Yexin, Zheng, Yinqiang, Zhou, Hefeng, Zhou, Zhanhui, Zhu, Pengyu, Zhu, Qingzi, Zhu, Yubo, Zhou, Bowen
We introduce SafeWork-R1, a cutting-edge multimodal reasoning model that demonstrates the coevolution of capabilities and safety. It is developed by our proposed SafeLadder framework, which incorporates large-scale, progressive, safety-oriented reinforcement learning post-training, supported by a suite of multi-principled verifiers. Unlike previous alignment methods such as RLHF that simply learn human preferences, SafeLadder enables SafeWork-R1 to develop intrinsic safety reasoning and self-reflection abilities, giving rise to safety `aha' moments. Notably, SafeWork-R1 achieves an average improvement of $46.54\%$ over its base model Qwen2.5-VL-72B on safety-related benchmarks without compromising general capabilities, and delivers state-of-the-art safety performance compared to leading proprietary models such as GPT-4.1 and Claude Opus 4. To further bolster its reliability, we implement two distinct inference-time intervention methods and a deliberative search mechanism, enforcing step-level verification. Finally, we further develop SafeWork-R1-InternVL3-78B, SafeWork-R1-DeepSeek-70B, and SafeWork-R1-Qwen2.5VL-7B. All resulting models demonstrate that safety and capability can co-evolve synergistically, highlighting the generalizability of our framework in building robust, reliable, and trustworthy general-purpose AI.
- Asia > China > Shanghai > Shanghai (0.40)
- South America > Chile > Santiago Metropolitan Region > Santiago Province > Santiago (0.04)
- North America > United States > New York > New York County > New York City (0.04)
- (2 more...)
- Workflow (0.93)
- Research Report > New Finding (0.46)
- Education > Educational Setting (0.48)
- Law (0.46)
Listwise Preference Alignment Optimization for Tail Item Recommendation
Li, Zihao, Yang, Chao, Zhang, Tong, Chen, Yakun, Wang, Xianzhi, Xu, Guandong, Dong, Daoyi
Preference alignment has achieved greater success on Large Language Models (LLMs) and drawn broad interest in recommendation research. Existing preference alignment methods for recommendation either require explicit reward modeling or only support pairwise preference comparison. The former directly increases substantial computational costs, while the latter hinders training efficiency on negative samples. Moreover, no existing effort has explored preference alignment solutions for tail-item recommendation. To bridge the above gaps, we propose LPO4Rec, which extends the Bradley-Terry model from pairwise comparison to listwise comparison, to improve the efficiency of model training. Specifically, we derive a closed form optimal policy to enable more efficient and effective training without explicit reward modeling. We also present an adaptive negative sampling and reweighting strategy to prioritize tail items during optimization and enhance performance in tail-item recommendations. Besides, we theoretically prove that optimizing the listwise preference optimization (LPO) loss is equivalent to maximizing the upper bound of the optimal reward. Our experiments on three public datasets show that our method outperforms 10 baselines by a large margin, achieving up to 50% performance improvement while reducing 17.9% GPU memory usage when compared with direct preference optimization (DPO) in tail-item recommendation. Our code is available at https://github.com/Yuhanleeee/LPO4Rec.
Using Vision Language Models to Detect Students' Academic Emotion through Facial Expressions
Wang, Deliang, Yang, Chao, Chen, Gaowei
Students' academic emotions significantly influence their social behavior and learning performance. Traditional approaches to automatically and accurately analyze these emotions have predominantly relied on supervised machine learning algorithms. However, these models often struggle to generalize across different contexts, necessitating repeated cycles of data collection, annotation, and training. The emergence of Vision-Language Models (VLMs) offers a promising alternative, enabling generalization across visual recognition tasks through zero-shot prompting without requiring fine-tuning. This study investigates the potential of VLMs to analyze students' academic emotions via facial expressions in an online learning environment. We employed two VLMs, Llama-3.2-11B-Vision-Instruct and Qwen2.5-VL-7B-Instruct, to analyze 5,000 images depicting confused, distracted, happy, neutral, and tired expressions using zero-shot prompting. Preliminary results indicate that both models demonstrate moderate performance in academic facial expression recognition, with Qwen2.5-VL-7B-Instruct outperforming Llama-3.2-11B-Vision-Instruct. Notably, both models excel in identifying students' happy emotions but fail to detect distracted behavior. Additionally, Qwen2.5-VL-7B-Instruct exhibits relatively high performance in recognizing students' confused expressions, highlighting its potential for practical applications in identifying content that causes student confusion.
- Research Report (1.00)
- Instructional Material (0.94)
- Education > Assessment & Standards > Student Performance (0.95)
- Education > Educational Setting (0.91)
SafeCoT: Improving VLM Safety with Minimal Reasoning
Ma, Jiachen, Zhou, Zhanhui, Yang, Chao, Lu, Chaochao
Ensuring safe and appropriate responses from vision-language models (VLMs) remains a critical challenge, particularly in high-risk or ambiguous scenarios. We introduce SafeCoT, a lightweight, interpretable framework that leverages rule-based chain-of-thought (CoT) supervision to improve refusal behavior in VLMs. Unlike prior methods that rely on large-scale safety annotations or complex modeling, SafeCoT uses minimal supervision to help models reason about safety risks and make context-aware refusals. Experiments across multiple benchmarks show that SafeCoT significantly reduces overrefusal and enhances generalization, even with limited training data. Our approach offers a scalable solution for aligning VLMs with safety-critical objectives.
- Asia > China > Shanghai > Shanghai (0.04)
- Asia > Middle East > Republic of Türkiye > Karaman Province > Karaman (0.04)
- Information Technology > Security & Privacy (0.93)
- Health & Medicine > Consumer Health (0.93)
- Health & Medicine > Therapeutic Area > Psychiatry/Psychology (0.46)
- Information Technology > Artificial Intelligence > Vision (1.00)
- Information Technology > Artificial Intelligence > Machine Learning (1.00)
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.48)
- Information Technology > Artificial Intelligence > Representation & Reasoning > Rule-Based Reasoning (0.36)
RePO: Replay-Enhanced Policy Optimization
Li, Siheng, Zhou, Zhanhui, Lam, Wai, Yang, Chao, Lu, Chaochao
Reinforcement learning (RL) is vital for optimizing large language models (LLMs). Recent Group Relative Policy Optimization (GRPO) estimates advantages using multiple on-policy outputs per prompt, leading to high computational costs and low data efficiency. To address this, we introduce Replay-Enhanced Policy Optimization (RePO), which leverages diverse replay strategies to retrieve off-policy samples from a replay buffer, allowing policy optimization based on a broader and more diverse set of samples for each prompt. Experiments on five LLMs across seven mathematical reasoning benchmarks demonstrate that RePO achieves absolute average performance gains of $18.4$ and $4.1$ points for Qwen2.5-Math-1.5B and Qwen3-1.7B, respectively, compared to GRPO. Further analysis indicates that RePO increases computational cost by $15\%$ while raising the number of effective optimization steps by $48\%$ for Qwen3-1.7B, with both on-policy and off-policy sample numbers set to $8$. The repository can be accessed at https://github.com/SihengLi99/RePO.
Mitigating Object Hallucination via Robust Local Perception Search
Gao, Zixian, Yang, Chao, Zhou, Zhanhui, Xu, Xing, Lu, Chaochao
Recent advancements in Multimodal Large Language Models (MLLMs) have enabled them to effectively integrate vision and language, addressing a variety of downstream tasks. However, despite their significant success, these models still exhibit hallucination phenomena, where the outputs appear plausible but do not align with the content of the images. To mitigate this issue, we introduce Local Perception Search (LPS), a decoding method during inference that is both simple and training-free, yet effectively suppresses hallucinations. This method leverages local visual prior information as a value function to correct the decoding process. Additionally, we observe that the impact of the local visual prior on model performance is more pronounced in scenarios with high levels of image noise. Notably, LPS is a plug-and-play approach that is compatible with various models. Extensive experiments on widely used hallucination benchmarks and noisy data demonstrate that LPS significantly reduces the incidence of hallucinations compared to the baseline, showing exceptional performance, particularly in noisy settings.
- Information Technology > Artificial Intelligence > Vision (1.00)
- Information Technology > Artificial Intelligence > Machine Learning (1.00)
- Information Technology > Artificial Intelligence > Representation & Reasoning (0.94)
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.71)
VLMs Can Aggregate Scattered Training Patches
Zhou, Zhanhui, Chen, Lingjie, Yang, Chao, Lu, Chaochao
One way to mitigate risks in vision-language models (VLMs) is to remove dangerous samples in their training data. However, such data moderation can be easily bypassed when harmful images are split into small, benign-looking patches, scattered across many training samples. VLMs may then learn to piece these fragments together during training and generate harmful responses at inference, either from full images or text references. For instance, if trained on image patches from a bloody scene paired with the descriptions "safe," VLMs may later describe, the full image or a text reference to the scene, as "safe." We define the core ability of VLMs enabling this attack as $\textit{visual stitching}$ -- the ability to integrate visual information spread across multiple training samples that share the same textual descriptions. In our work, we first demonstrate visual stitching abilities in common open-source VLMs on three datasets where each image is labeled with a unique synthetic ID: we split each $(\texttt{image}, \texttt{ID})$ pair into $\{(\texttt{patch}, \texttt{ID})\}$ pairs at different granularity for finetuning, and we find that tuned models can verbalize the correct IDs from full images or text reference. Building on this, we simulate the adversarial data poisoning scenario mentioned above by using patches from dangerous images and replacing IDs with text descriptions like ``safe'' or ``unsafe'', demonstrating how harmful content can evade moderation in patches and later be reconstructed through visual stitching, posing serious VLM safety risks. Code is available at https://github.com/ZHZisZZ/visual-stitching.