safety score
Balancing Safety and Helpfulness in Healthcare AI Assistants through Iterative Preference Alignment
Nghiem, Huy, Panda, Swetasudha, Khatwani, Devashish, Nguyen, Huy V., Kenthapadi, Krishnaram, Daumé, Hal III
Large Language Models (LLMs) are increasingly used in healthcare, yet ensuring their safety and trustworthiness remains a barrier to deployment. Conversational medical assistants must avoid unsafe compliance without over-refusing benign queries. We present an iterative post-deployment alignment framework that applies Kahneman-Tversky Optimization (KTO) and Direct Preference Optimization (DPO) to refine models against domain-specific safety signals. Using the CARES-18K benchmark for adversarial robustness, we evaluate four LLMs (Llama-3B/8B, Meditron-8B, Mistral-7B) across multiple cycles. Our results show up to 42% improvement in safety-related metrics for harmful query detection, alongside interesting trade-offs against erroneous refusals, thereby exposing architecture-dependent calibration biases. We also perform ablation studies to identify when self-evaluation is reliable and when external or finetuned judges are necessary to maximize performance gains. Our findings underscore the importance of adopting best practices that balance patient safety, user trust, and clinical utility in the design of conversational medical assistants.
- Europe > Austria > Vienna (0.14)
- North America > United States > Maryland (0.04)
AssurAI: Experience with Constructing Korean Socio-cultural Datasets to Discover Potential Risks of Generative AI
Lim, Chae-Gyun, Han, Seung-Ho, Byun, EunYoung, Han, Jeongyun, Cho, Soohyun, Joo, Eojin, Kim, Heehyeon, Kim, Sieun, Lee, Juhoon, Lee, Hyunsoo, Lee, Dongkun, Hyeon, Jonghwan, Hwang, Yechan, Lee, Young-Jun, Lee, Kyeongryul, An, Minhyeong, Ahn, Hyunjun, Son, Jeongwoo, Park, Junho, Yoon, Donggyu, Kim, Taehyung, Kim, Jeemin, Choi, Dasom, Lee, Kwangyoung, Lim, Hyunseung, Jung, Yeohyun, Hong, Jongok, Nam, Sooyohn, Park, Joonyoung, Na, Sungmin, Choi, Yubin, Choi, Jeanne, Hong, Yoojin, Jang, Sueun, Seo, Youngseok, Park, Somin, Jo, Seoungung, Chae, Wonhye, Jo, Yeeun, Kim, Eunyoung, Whang, Joyce Jiyoung, Hong, HwaJung, Seering, Joseph, Lee, Uichin, Kim, Juho, Choi, Sunna, Ko, Seokyeon, Kim, Taeho, Kim, Kyunghoon, Ha, Myungsik, Lee, So Jung, Hwang, Jemin, Kwak, JoonHo, Choi, Ho-Jin
The rapid evolution of generative AI necessitates robust safety evaluations. However, current safety datasets are predominantly English-centric, failing to capture specific risks in non-English, socio-cultural contexts such as Korean, and are often limited to the text modality. To address this gap, we introduce AssurAI, a new quality-controlled Korean multimodal dataset for evaluating the safety of generative AI. First, we define a taxonomy of 35 distinct AI risk factors, adapted from established frameworks by a multidisciplinary expert group to cover both universal harms and relevance to the Korean socio-cultural context. Second, leveraging this taxonomy, we construct and release AssurAI, a large-scale Korean multimodal dataset comprising 11,480 instances across text, image, video, and audio. Third, we apply the rigorous quality control process used to ensure data integrity, featuring a two-phase construction (i.e., expert-led seeding and crowdsourced scaling), triple independent annotation, and an iterative expert red-teaming loop. Our pilot study validates AssurAI's effectiveness in assessing the safety of recent LLMs. We release AssurAI to the public to facilitate the development of safer and more reliable generative AI systems for the Korean community.
- Research Report > Experimental Study (0.69)
- Research Report > New Finding (0.68)
- Law Enforcement & Public Safety > Crime Prevention & Enforcement (1.00)
- Information Technology > Security & Privacy (1.00)
- Law > Criminal Law (0.68)
- Government (0.68)
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
- Information Technology > Artificial Intelligence > Natural Language > Generation (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning > Generative AI (1.00)
EcoAlign: An Economically Rational Framework for Efficient LVLM Alignment
Cheng, Ruoxi, Ma, Haoxuan, Ma, Teng, Zhang, Hongyi
Large Vision-Language Models (LVLMs) exhibit powerful reasoning capabilities but suffer sophisticated jailbreak vulnerabilities. Fundamentally, aligning LVLMs is not just a safety challenge but a problem of economic efficiency. Current alignment methods struggle with the trade-off between safety, utility, and operational costs. Critically, a focus solely on final outputs (process-blindness) wastes significant computational budget on unsafe deliberation. This flaw allows harmful reasoning to be disguised with benign justifications, thereby circumventing simple additive safety scores. To address this, we propose EcoAlign, an inference-time framework that reframes alignment as an economically rational search by treating the LVLM as a boundedly rational agent. EcoAlign incrementally expands a thought graph and scores actions using a forward-looking function (analogous to net present value) that dynamically weighs expected safety, utility, and cost against the remaining budget. To prevent deception, path safety is enforced via the weakest-link principle. Extensive experiments across 3 closed-source and 2 open-source models on 6 datasets show that EcoAlign matches or surpasses state-of-the-art safety and utility at a lower computational cost, thereby offering a principled, economical pathway to robust LVLM alignment.
- Law Enforcement & Public Safety > Crime Prevention & Enforcement (1.00)
- Information Technology > Security & Privacy (1.00)
- Law (0.93)
- Government > Military (0.68)
Optimizing AI Agent Attacks With Synthetic Data
Loughridge, Chloe, Colognese, Paul, Griffin, Avery, Tracy, Tyler, Kutasov, Jon, Benton, Joe
As AI deployments become more complex and high-stakes, it becomes increasingly important to be able to estimate their risk. AI control is one framework for doing so. However, good control evaluations require eliciting strong attack policies. This can be challenging in complex agentic environments where compute constraints leave us data-poor. In this work, we show how to optimize attack policies in SHADE-Arena, a dataset of diverse realistic control environments. We do this by decomposing attack capability into five constituent skills -- suspicion modeling, attack selection, plan synthesis, execution and subtlety -- and optimizing each component individually. To get around the constraint of limited data, we develop a probabilistic model of attack dynamics, optimize our attack hyperparameters using this simulation, and then show that the results transfer to SHADE-Arena. This results in a substantial improvement in attack strength, reducing safety score from a baseline of 0.87 to 0.41 using our scaffold.
Check Yourself Before You Wreck Yourself: Selectively Quitting Improves LLM Agent Safety
Bonagiri, Vamshi Krishna, Kumaragurum, Ponnurangam, Nguyen, Khanh, Plaut, Benjamin
As Large Language Model (LLM) agents increasingly operate in complex environments with real-world consequences, their safety becomes critical. While uncertainty quantification is well-studied for single-turn tasks, multi-turn agentic scenarios with real-world tool access present unique challenges where uncertainties and ambiguities compound, leading to severe or catastrophic risks beyond traditional text generation failures. We propose using "quitting" as a simple yet effective behavioral mechanism for LLM agents to recognize and withdraw from situations where they lack confidence. Leveraging the ToolEmu framework, we conduct a systematic evaluation of quitting behavior across 12 state-of-the-art LLMs. Our results demonstrate a highly favorable safety-helpfulness trade-off: agents prompted to quit with explicit instructions improve safety by an average of +0.39 on a 0-3 scale across all models (+0.64 for proprietary models), while maintaining a negligible average decrease of -0.03 in helpfulness. Our analysis demonstrates that simply adding explicit quit instructions proves to be a highly effective safety mechanism that can immediately be deployed in existing agent systems, and establishes quitting as an effective first-line defense mechanism for autonomous agents in high-stakes applications.
- Information Technology > Security & Privacy (1.00)
- Banking & Finance > Trading (0.69)
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.70)
- Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Undirected Networks > Markov Models (0.46)
AD-VF: LLM-Automatic Differentiation Enables Fine-Tuning-Free Robot Planning from Formal Methods Feedback
Yang, Yunhao, Hong, Junyuan, Perin, Gabriel Jacob, Fan, Zhiwen, Yin, Li, Wang, Zhangyang, Topcu, Ufuk
Large language models (LLMs) can translate natural language instructions into executable action plans for robotics, autonomous driving, and other domains. Yet, deploying LLM-driven planning in the physical world demands strict adherence to safety and regulatory constraints, which current models often violate due to hallucination or weak alignment. Traditional data-driven alignment methods, such as Direct Preference Optimization (DPO), require costly human labeling, while recent formal-feedback approaches still depend on resource-intensive fine-tuning. In this paper, we propose LAD-VF, a fine-tuning-free framework that leverages formal verification feedback for automated prompt engineering. By introducing a formal-verification-informed text loss integrated with LLM-AutoDiff, LAD-VF iteratively refines prompts rather than model parameters. This yields three key benefits: (i) scalable adaptation without fine-tuning; (ii) compatibility with modular LLM architectures; and (iii) interpretable refinement via auditable prompts. Experiments in robot navigation and manipulation tasks demonstrate that LAD-VF substantially enhances specification compliance, improving success rates from 60% to over 90%. Our method thus presents a scalable and interpretable pathway toward trustworthy, formally-verified LLM-driven control systems.
- North America > United States > Texas > Travis County > Austin (0.14)
- South America > Brazil > São Paulo (0.04)
- Oceania > New Zealand (0.04)
- (4 more...)
- Research Report (0.64)
- Workflow (0.47)
Safety filtering of robotic manipulation under environment uncertainty: a computational approach
Johansson, Anna, Lindmark, Daniel, Wiberg, Viktor, Servin, Martin
Abstract-- Robotic manipulation in dynamic and unstructured environments requires safety mechanisms that exploit what is known and what is uncertain about the world. Existing safety filters often assume full observability, limiting their applicability in real-world tasks. We propose a physics-based safety filtering scheme that leverages high-fidelity simulation to assess control policies under uncertainty in world parameters. The method combines dense rollout with nominal parameters and parallelizable sparse re-evaluation at critical state-transitions, quantified through generalized factors of safety for stable grasping and actuator limits, and targeted uncertainty reduction through probing actions. We demonstrate the approach in a simulated bimanual manipulation task with uncertain object mass and friction, showing that unsafe trajectories can be identified and filtered efficiently. Our results highlight physics-based sparse safety evaluation as a scalable strategy for safe robotic manipulation under uncertainty. The growing deployment of autonomous robots beyond traditional assembly-line settings presents an increasing need for control schemes capable of handling complex and dynamic environments while guaranteeing both task success and safety.
- Europe > Sweden (0.05)
- North America > United States > Texas > Travis County > Austin (0.04)
Safety Pretraining: Toward the Next Generation of Safe AI
Maini, Pratyush, Goyal, Sachin, Sam, Dylan, Robey, Alex, Savani, Yash, Jiang, Yiding, Zou, Andy, Fredrikson, Matt, Lipton, Zacharcy C., Kolter, J. Zico
As large language models (LLMs) are increasingly deployed in high-stakes settings, the risk of generating harmful or toxic content remains a central challenge. Post-hoc alignment methods are brittle: once unsafe patterns are learned during pretraining, they are hard to remove. In this work, we present a data-centric pretraining framework that builds safety into the model from the start. Our framework consists of four key steps: (i) Safety Filtering: building a safety classifier to classify webdata into safe and unsafe categories; (ii) Safety Rephrasing: we recontextualize unsafe webdata into safer narratives; (iii) Native Refusal: we develop RefuseWeb and Moral Education pretraining datasets that actively teach model to refuse on unsafe content and the moral reasoning behind it, and (iv) Harmfulness-Tag annotated pretraining: we flag unsafe content during pretraining using a special token, and use it to steer model away from unsafe generations at inference. Our safety-pretrained models reduce attack success rates from 38.8\% to 8.4\% on standard LLM safety benchmarks with no performance degradation on general tasks.
- North America > United States > Minnesota > St. Louis County > Duluth (0.04)
- North America > United States > Minnesota > Saint Louis County > Duluth (0.04)
- Asia > Thailand > Bangkok > Bangkok (0.04)
- Instructional Material (0.92)
- Research Report > New Finding (0.46)
Can You Trust an LLM with Your Life-Changing Decision? An Investigation into AI High-Stakes Responses
Cahyono, Joshua Adrian, Subramanian, Saran
Large Language Models (LLMs) are increasingly consulted for high-stakes life advice, yet they lack standard safeguards against providing confident but misguided responses. This creates risks of sycophancy and over-confidence. This paper investigates these failure modes through three experiments: (1) a multiple-choice evaluation to measure model stability against user pressure; (2) a free-response analysis using a novel safety typology and an LLM Judge; and (3) a mechanistic interpretability experiment to steer model behavior by manipulating a "high-stakes" activation vector. Our results show that while some models exhibit sycophancy, others like o4-mini r emain robust. Top-performing models achieve high safety scores by frequently asking clarifying questions--a key feature of a safe, inquisitive approach--rather than issuing prescriptive advice . Furthermore, we demonstrate that a model's cautiousness can be directly controlled via activation steering, suggesting a new path for safety alignment. These findings underscore the need for nuanced, multi-faceted benchmarks to ensure LLMs can be trusted with life-changing decisions.
The Blessing and Curse of Dimensionality in Safety Alignment
Teo, Rachel S. Y., Abdullaev, Laziz U., Nguyen, Tan M.
The focus on safety alignment in large language models (LLMs) has increased significantly due to their widespread adoption across different domains. The scale of LLMs play a contributing role in their success, and the growth in parameter count follows larger hidden dimensions. In this paper, we hypothesize that while the increase in dimensions has been a key advantage, it may lead to emergent problems as well. These problems emerge as the linear structures in the activation space can be exploited, in the form of activation engineering, to circumvent its safety alignment. Through detailed visualizations of linear subspaces associated with different concepts, such as safety, across various model scales, we show that the curse of high-dimensional representations uniquely impacts LLMs. Further substantiating our claim, we demonstrate that projecting the representations of the model onto a lower dimensional subspace can preserve sufficient information for alignment while avoiding those linear structures. Empirical results confirm that such dimensional reduction significantly reduces susceptibility to jailbreaking through representation engineering. Building on our empirical validations, we provide theoretical insights into these linear jailbreaking methods relative to a model's hidden dimensions. Broadly speaking, our work posits that the high dimensions of a model's internal representations can be both a blessing and a curse in safety alignment.
- Asia > Singapore (0.04)
- North America > United States > Florida > Miami-Dade County > Miami (0.04)
- North America > Canada (0.04)
- (8 more...)