Law
Personalized Safety Alignment for Text-to-Image Diffusion Models
Lei, Yu, Bai, Jinbin, Shi, Qingyu, Feng, Aosong, Yu, Kaidong
Text-to-image diffusion models have revolutionized visual content generation, but current safety mechanisms apply uniform standards that often fail to account for individual user preferences. These models overlook the diverse safety boundaries shaped by factors like age, mental health, and personal beliefs. To address this, we propose Personalized Safety Alignment (PSA), a framework that allows user-specific control over safety behaviors in generative models. PSA integrates personalized user profiles into the diffusion process, adjusting the model's behavior to match individual safety preferences while preserving image quality. We introduce a new dataset, Sage, which captures user-specific safety preferences and incorporates these profiles through a cross-attention mechanism. Experiments show that PSA outperforms existing methods in harmful content suppression and aligns generated content better with user constraints, achieving higher Win Rate and Pass Rate scores.
Modular Delta Merging with Orthogonal Constraints: A Scalable Framework for Continual and Reversible Model Composition
Khan, Haris, Asif, Sadia, Asif, Shumaila
In real-world machine learning deployments, models must be continually updated, composed, and when required, selectively undone. However, existing approaches to model merging and continual learning often suffer from task interference, catastrophic forgetting, or lack of reversibility. We propose Modular Delta Merging with Orthogonal Constraints (MDM-OC), a novel framework that enables scalable, interference-free, and reversible composition of fine-tuned models. Each task-specific model is encoded as a delta from a shared base and projected into an orthogonal subspace to eliminate conflict. These projected deltas are then merged via gradient-based optimization to form a unified model that retains performance across tasks. Our approach supports continual integration of new models, structured unmerging for compliance such as GDPR requirements, and model stability via elastic weight consolidation and synthetic replay. Extensive experiments on vision and natural language processing benchmarks demonstrate that MDM-OC outperforms prior baselines in accuracy, backward transfer, and unmerge fidelity, while remaining memory-efficient and computationally tractable. This framework offers a principled solution for modular and compliant AI system design.
SafeWork-R1: Coevolving Safety and Intelligence under the AI-45$^{\circ}$ Law
Lab, Shanghai AI, :, null, Bao, Yicheng, Chen, Guanxu, Chen, Mingkang, Chen, Yunhao, Chen, Chiyu, Chen, Lingjie, Chen, Sirui, Chen, Xinquan, Cheng, Jie, Cheng, Yu, Deng, Dengke, Ding, Yizhuo, Ding, Dan, Ding, Xiaoshan, Ding, Yi, Dong, Zhichen, Du, Lingxiao, Fan, Yuyu, Feng, Xinshun, Fu, Yanwei, Gao, Yuxuan, Ge, Ruijun, Gu, Tianle, Gui, Lujun, Guo, Jiaxuan, He, Qianxi, Hou, Yuenan, Hu, Xuhao, Huang, Hong, Huang, Kaichen, Huang, Shiyang, Jiang, Yuxian, Lei, Shanzhe, Li, Jie, Li, Lijun, Li, Hao, Li, Juncheng, Li, Xiangtian, Li, Yafu, Li, Lingyu, Li, Xueyan, Liang, Haotian, Liu, Dongrui, Liu, Qihua, Liu, Zhixuan, Liu, Bangwei, Liu, Huacan, Liu, Yuexiao, Liu, Zongkai, Lu, Chaochao, Lu, Yudong, Lu, Xiaoya, Lu, Zhenghao, Lv, Qitan, Ma, Caoyuan, Ma, Jiachen, Ma, Xiaoya, Ma, Zhongtian, Meng, Lingyu, Miao, Ziqi, Niu, Yazhe, Peng, Yuezhang, Pu, Yuan, Qi, Han, Qian, Chen, Qiao, Xingge, Qu, Jingjing, Qu, Jiashu, Qu, Wanying, Qu, Wenwen, Qu, Xiaoye, Ren, Qihan, Ren, Qingnan, Ren, Qingyu, Shao, Jing, Shao, Wenqi, Shao, Shuai, Shi, Dongxing, Song, Xin, Song, Xinhao, Teng, Yan, Tong, Xuan, Wang, Yingchun, Wang, Xuhong, Wang, Shujie, Wang, Xin, Wang, Yige, Wang, Yixu, Wang, Yuanfu, Wang, Futing, Wang, Ruofan, Wang, Wenjie, Wang, Yajie, Wei, Muhao, Wen, Xiaoyu, Weng, Fenghua, Wu, Yuqi, Xiong, Yingtong, Xu, Xingcheng, Yang, Chao, Yang, Yue, Yao, Yang, Ye, Yulei, Yin, Zhenyun, Yu, Yi, Zhang, Bo, Zhang, Qiaosheng, Zhang, Jinxuan, Zhang, Yexin, Zheng, Yinqiang, Zhou, Hefeng, Zhou, Zhanhui, Zhu, Pengyu, Zhu, Qingzi, Zhu, Yubo, Zhou, Bowen
We introduce SafeWork-R1, a cutting-edge multimodal reasoning model that demonstrates the coevolution of capabilities and safety. It is developed by our proposed SafeLadder framework, which incorporates large-scale, progressive, safety-oriented reinforcement learning post-training, supported by a suite of multi-principled verifiers. Unlike previous alignment methods such as RLHF that simply learn human preferences, SafeLadder enables SafeWork-R1 to develop intrinsic safety reasoning and self-reflection abilities, giving rise to safety `aha' moments. Notably, SafeWork-R1 achieves an average improvement of $46.54\%$ over its base model Qwen2.5-VL-72B on safety-related benchmarks without compromising general capabilities, and delivers state-of-the-art safety performance compared to leading proprietary models such as GPT-4.1 and Claude Opus 4. To further bolster its reliability, we implement two distinct inference-time intervention methods and a deliberative search mechanism, enforcing step-level verification. Finally, we further develop SafeWork-R1-InternVL3-78B, SafeWork-R1-DeepSeek-70B, and SafeWork-R1-Qwen2.5VL-7B. All resulting models demonstrate that safety and capability can co-evolve synergistically, highlighting the generalizability of our framework in building robust, reliable, and trustworthy general-purpose AI.
McBE: A Multi-task Chinese Bias Evaluation Benchmark for Large Language Models
Lan, Tian, Su, Xiangdong, Liu, Xu, Wang, Ruirui, Chang, Ke, Li, Jiang, Gao, Guanglai
As large language models (LLMs) are increasingly applied to various NLP tasks, their inherent biases are gradually disclosed. Therefore, measuring biases in LLMs is crucial to mitigate its ethical risks. However, most existing bias evaluation datasets focus on English and North American culture, and their bias categories are not fully applicable to other cultures. The datasets grounded in the Chinese language and culture are scarce. More importantly, these datasets usually only support single evaluation tasks and cannot evaluate the bias from multiple aspects in LLMs. To address these issues, we present a Multi-task Chinese Bias Evaluation Benchmark (McBE) that includes 4,077 bias evaluation instances, covering 12 single bias categories, 82 subcategories and introducing 5 evaluation tasks, providing extensive category coverage, content diversity, and measuring comprehensiveness. Additionally, we evaluate several popular LLMs from different series and with parameter sizes. In general, all these LLMs demonstrated varying degrees of bias. We conduct an in-depth analysis of results, offering novel insights into bias in LLMs.
Multi-level Value Alignment in Agentic AI Systems: Survey and Perspectives
Zeng, Wei, Zhu, Hengshu, Qin, Chuan, Wu, Han, Cheng, Yihang, Zhang, Sirui, Jin, Xiaowei, Shen, Yinuo, Wang, Zhenxing, Zhong, Feimin, Xiong, Hui
The ongoing evolution of AI paradigms has propelled AI research into the agentic AI stage. Consequently, the focus of research has shifted from single agents and simple applications towards multi-agent autonomous decision-making and task collaboration in complex environments. As Large Language Models (LLMs) advance, their applications become more diverse and complex, leading to increasing situational and systemic risks. This has brought significant attention to value alignment for agentic AI systems, which aims to ensure that an agent's goals, preferences, and behaviors align with human values and societal norms. Addressing socio-governance demands through a Multi-level Value framework, this study comprehensively reviews value alignment in LLM-based multi-agent systems as the representative archetype of agentic AI systems. Our survey systematically examines three interconnected dimensions: First, value principles are structured via a top-down hierarchy across macro, meso, and micro levels. Second, application scenarios are categorized along a general-to-specific continuum explicitly mirroring these value tiers. Third, value alignment methods and evaluation are mapped to this tiered framework through systematic examination of benchmarking datasets and relevant methodologies. Additionally, we delve into value coordination among multiple agents within agentic AI systems. Finally, we propose several potential research directions in this field.
JULI: Jailbreak Large Language Models by Self-Introspection
Wang, Jesson, Hu, Zhanhao, Wagner, David
Large Language Models (LLMs) are trained with safety alignment to prevent generating malicious content. Although some attacks have highlighted vulnerabilities in these safety-aligned LLMs, they typically have limitations, such as necessitating access to the model weights or the generation process. Since proprietary models through API-calling do not grant users such permissions, these attacks find it challenging to compromise them. In this paper, we propose Jailbreaking Using LLM Introspection (JULI), which jailbreaks LLMs by manipulating the token log probabilities, using a tiny plug-in block, BiasNet. JULI relies solely on the knowledge of the target LLM's predicted token log probabilities. It can effectively jailbreak API-calling LLMs under a black-box setting and knowing only top-$5$ token log probabilities. Our approach demonstrates superior effectiveness, outperforming existing state-of-the-art (SOTA) approaches across multiple metrics.
Can LLMs Be Trusted for Evaluating RAG Systems? A Survey of Methods and Datasets
Brehme, Lorenz, Ströhle, Thomas, Breu, Ruth
Can LLMs Be Trusted for Evaluating RAG Systems? Abstract--Retrieval-Augmented Generation (RAG) has advanced significantly in recent years. The complexity of RAG systems, which involve multiple components--such as indexi ng, retrieval, and generation--along with numerous other param e-ters, poses substantial challenges for systematic evaluat ion and quality enhancement. Previous research highlights that ev aluating RAG systems is essential for documenting advancements, com - paring configurations, and identifying effective approach es for domain-specific applications. This study systematically r eviews 63 academic articles to provide a comprehensive overview of state-of-the-art RAG evaluation methodologies, focusing on four key areas: datasets, retrievers, indexing and databases, a nd the generator component. We observe the feasibility of an automated evaluation approach for each component of a RAG system, leveraging an LLM capable of both generating evalua tion datasets and conducting evaluations. In addition, we found that further practical research is essential to provide compani es with clear guidance on the do's and don'ts of implementing and evaluating RAG systems. By synthesizing evaluation approa ches for key RAG components and emphasizing the creation and adaptation of domain-specific datasets for benchmarking, w e contribute to the advancement of systematic evaluation met hods and the improvement of evaluation rigor for RAG systems. Furthermore, by examining the interplay between automated approaches leveraging LLMs and human judgment, we contribute to the ongoing discourse on balancing automation and human input, clarifying their respective contributions, limita tions, and challenges in achieving robust and reliable evaluations. In recent years, Large Language Models (LLMs) have made significant progress in research and have grown increasingl y popular [1]. However, LLMs face several challenges, includ - ing issues with hallucinations caused by insufficient conte xt [2], as well as limitations in their learned content, which prevent them from addressing questions requiring specific or proprietary information [1].
Leveraging Deep Learning for Physical Model Bias of Global Air Quality Estimates
Doerksen, Kelsey, Marchetti, Yuliya, Bowman, Kevin, Lu, Steven, Montgomery, James, Gal, Yarin, Kalaitzis, Freddie, Miyazaki, Kazuyuki
Air pollution is the world's largest environmental risk factor for human disease and premature death, resulting in more than 6 million permature deaths in 2019. Currently, there is still a challenge to model one of the most important air pollutants, surface ozone, particularly at scales relevant for human health impacts, with the drivers of global ozone trends at these scales largely unknown, limiting the practical use of physics-based models. We employ a 2D Convolutional Neural Network based architecture that estimate surface ozone MOMO-Chem model residuals, referred to as model bias. We demonstrate the potential of this technique in North America and Europe, highlighting its ability better to capture physical model residuals compared to a traditional machine learning method. We assess the impact of incorporating land use information from high-resolution satellite imagery to improve model estimates. Importantly, we discuss how our results can improve our scientific understanding of the factors impacting ozone bias at urban scales that can be used to improve environmental policy.
Uncertainty Quantification for Surface Ozone Emulators using Deep Learning
Doerksen, Kelsey, Marchetti, Yuliya, Lu, Steven, Bowman, Kevin, Montgomery, James, Miyazaki, Kazuyuki, Gal, Yarin, Kalaitzis, Freddie
Air pollution is a global hazard, and as of 2023, 94\% of the world's population is exposed to unsafe pollution levels. Surface Ozone (O3), an important pollutant, and the drivers of its trends are difficult to model, and traditional physics-based models fall short in their practical use for scales relevant to human-health impacts. Deep Learning-based emulators have shown promise in capturing complex climate patterns, but overall lack the interpretability necessary to support critical decision making for policy changes and public health measures. We implement an uncertainty-aware U-Net architecture to predict the Multi-mOdel Multi-cOnstituent Chemical data assimilation (MOMO-Chem) model's surface ozone residuals (bias) using Bayesian and quantile regression methods. We demonstrate the capability of our techniques in regional estimation of bias in North America and Europe for June 2019. We highlight the uncertainty quantification (UQ) scores between our two UQ methodologies and discern which ground stations are optimal and sub-optimal candidates for MOMO-Chem bias correction, and evaluate the impact of land-use information in surface ozone residual modeling.
Learning to Reason for Factuality
Chen, Xilun, Kulikov, Ilia, Berges, Vincent-Pierre, Oğuz, Barlas, Shao, Rulin, Ghosh, Gargi, Weston, Jason, Yih, Wen-tau
Reasoning Large Language Models (R-LLMs) have significantly advanced complex reasoning tasks but often struggle with factuality, generating substantially more hallucinations than their non-reasoning counterparts on long-form factuality benchmarks. However, extending online Reinforcement Learning (RL), a key component in recent R-LLM advancements, to the long-form factuality setting poses several unique challenges due to the lack of reliable verification methods. Previous work has utilized automatic factuality evaluation frameworks such as FActScore to curate preference data in the offline RL setting, yet we find that directly leveraging such methods as the reward in online RL leads to reward hacking in multiple ways, such as producing less detailed or relevant responses. We propose a novel reward function that simultaneously considers the factual precision, response detail level, and answer relevance, and applies online RL to learn high quality factual reasoning. Evaluated on six long-form factuality benchmarks, our factual reasoning model achieves an average reduction of 23.1 percentage points in hallucination rate, a 23% increase in answer detail level, and no degradation in the overall response helpfulness.