safety model
- North America > United States > Pennsylvania (0.04)
- Asia > China > Zhejiang Province (0.04)
- Research Report > Experimental Study (1.00)
- Research Report > New Finding (0.67)
- North America > United States > Pennsylvania (0.04)
- Europe > Latvia > Lubāna Municipality > Lubāna (0.04)
- Asia > Middle East > Jordan (0.04)
- Research Report > Experimental Study (0.46)
- Research Report > New Finding (0.46)
- North America > United States > California > Santa Clara County > Cupertino (0.04)
- Asia > Middle East > Jordan (0.04)
Towards Safe Reinforcement Learning with a Safety Editor Policy
We consider the safe reinforcement learning (RL) problem of maximizing utility with extremely low constraint violation rates. Assuming no prior knowledge or pre-training of the environment safety model given a task, an agent has to learn, via exploration, which states and actions are safe. A popular approach in this line of research is to combine a model-free RL algorithm with the Lagrangian method to adjust the weight of the constraint reward relative to the utility reward dynamically. It relies on a single policy to handle the conflict between utility and constraint rewards, which is often challenging. We present SEditor, a two-policy approach that learns a safety editor policy transforming potentially unsafe actions proposed by a utility maximizer policy into safe ones.
- North America > United States > Pennsylvania (0.04)
- Asia > China > Zhejiang Province (0.04)
- Research Report > Experimental Study (1.00)
- Research Report > New Finding (0.67)
- North America > United States > Pennsylvania (0.04)
- Europe > Latvia > Lubāna Municipality > Lubāna (0.04)
- Asia > Middle East > Jordan (0.04)
- Research Report > Experimental Study (0.46)
- Research Report > New Finding (0.46)
- North America > United States > California > Santa Clara County > Cupertino (0.04)
- Asia > Middle East > Jordan (0.04)
Integrating Perceptions: A Human-Centered Physical Safety Model for Human-Robot Interaction
Pandey, Pranav, Parasuraman, Ramviyas, Doshi, Prashant
Ensuring safety in human-robot interaction (HRI) is essential to foster user trust and enable the broader adoption of robotic systems. Traditional safety models primarily rely on sensor-based measures, such as relative distance and velocity, to assess physical safety. However, these models often fail to capture subjective safety perceptions, which are shaped by individual traits and contextual factors. In this paper, we introduce and analyze a parameterized general safety model that bridges the gap between physical and perceived safety by incorporating a personalization parameter, $ρ$, into the safety measurement framework to account for individual differences in safety perception. Through a series of hypothesis-driven human-subject studies in a simulated rescue scenario, we investigate how emotional state, trust, and robot behavior influence perceived safety. Our results show that $ρ$ effectively captures meaningful individual differences, driven by affective responses, trust in task consistency, and clustering into distinct user types. Specifically, our findings confirm that predictable and consistent robot behavior as well as the elicitation of positive emotional states, significantly enhance perceived safety. Moreover, responses cluster into a small number of user types, supporting adaptive personalization based on shared safety models. Notably, participant role significantly shapes safety perception, and repeated exposure reduces perceived safety for participants in the casualty role, emphasizing the impact of physical interaction and experiential change. These findings highlight the importance of adaptive, human-centered safety models that integrate both psychological and behavioral dimensions, offering a pathway toward more trustworthy and effective HRI in safety-critical domains.
- North America > United States > Georgia > Clarke County > Athens (0.14)
- Asia > Middle East > Jordan (0.04)
- Health & Medicine > Therapeutic Area (0.68)
- Government (0.68)
Adversarial Tokenization
Geh, Renato Lui, Shao, Zilei, Broeck, Guy Van den
Current LLM pipelines account for only one possible tokenization for a given string, ignoring exponentially many alternative tokenizations during training and inference. For example, the standard Llama3 tokenization of penguin is [p,enguin], yet [peng,uin] is another perfectly valid alternative. In this paper, we show that despite LLMs being trained solely on one tokenization, they still retain semantic understanding of other tokenizations, raising questions about their implications in LLM safety. Put succinctly, we answer the following question: can we adversarially tokenize an obviously malicious string to evade safety and alignment restrictions? We show that not only is adversarial tokenization an effective yet previously neglected axis of attack, but it is also competitive against existing state-of-the-art adversarial approaches without changing the text of the harmful request. We empirically validate this exploit across three state-of-the-art LLMs and adversarial datasets, revealing a previously unknown vulnerability in subword models.
- North America > United States > California > Los Angeles County > Los Angeles (0.14)
- Africa > Cameroon (0.14)
- North America > United States > Florida > Miami-Dade County > Miami (0.04)
- (19 more...)
- Information Technology > Security & Privacy (1.00)
- Government (0.68)
A Method for Enhancing the Safety of Large Model Generation Based on Multi-dimensional Attack and Defense
Currently, large models are prone to generating harmful content when faced with complex attack instructions, significantly reducing their defensive capabilities. To address this issue, this paper proposes a method based on constructing data aligned with multi-dimensional attack defense to enhance the generative security of large models. The core of our method lies in improving the effectiveness of safe alignment learning for large models by innova-tively increasing the diversity of attack instruction dimensions and the accuracy of generat-ing safe responses. To validate the effectiveness of our method, beyond existing security evaluation benchmarks, we additionally designed new security evaluation benchmarks and conducted comparative experiments using Llama3.2 as the baseline model. The final ex-perimental results demonstrate that our method can significantly improve the generative security of large models under complex instructional attacks, while also maintaining and enhancing the models' general capabilities.