safety
Empirical Validation of the Classification-Verification Dichotomy for AI Safety Gates
Can classifier-based safety gates maintain reliable oversight as AI systems improve over hundreds of iterations? We provide comprehensive empirical evidence that they cannot. On a self-improving neural controller (d=240), eighteen classifier configurations -- spanning MLPs, SVMs, random forests, k-NN, Bayesian classifiers, and deep networks -- all fail the dual conditions for safe self-improvement. Three safe RL baselines (CPO, Lyapunov, safety shielding) also fail. Results extend to MuJoCo benchmarks (Reacher-v4 d=496, Swimmer-v4 d=1408, HalfCheetah-v4 d=1824). At controlled distribution separations up to delta_s=2.0, all classifiers still fail -- including the NP-optimal test and MLPs with 100% training accuracy -- demonstrating structural impossibility. We then show the impossibility is specific to classification, not to safe self-improvement itself. A Lipschitz ball verifier achieves zero false accepts across dimensions d in {84, 240, 768, 2688, 5760, 9984, 17408} using provable analytical bounds (unconditional delta=0). Ball chaining enables unbounded parameter-space traversal: on MuJoCo Reacher-v4, 10 chains yield +4.31 reward improvement with delta=0; on Qwen2.5-7B-Instruct during LoRA fine-tuning, 42 chain transitions traverse 234x the single-ball radius with zero safety violations across 200 steps. A 50-prompt oracle confirms oracle-agnosticity. Compositional per-group verification enables radii up to 37x larger than full-network balls. At d<=17408, delta=0 is unconditional; at LLM scale, conditional on estimated Lipschitz constants.
A Lyapunov-based Approach to Safe Reinforcement Learning
In many real-world reinforcement learning (RL) problems, besides optimizing the main objective function, an agent must concurrently avoid violating a number of constraints. In particular, besides optimizing performance, it is crucial to guarantee the safety of an agent during training as well as deployment (e.g., a robot should avoid taking actions - exploratory or not - which irrevocably harm its hardware). To incorporate safety in RL, we derive algorithms under the framework of constrained Markov decision processes (CMDPs), an extension of the standard Markov decision processes (MDPs) augmented with constraints on expected cumulative costs.
Extending the reward structure in reinforcement learning: an interview with Tanmay Ambadkar
In this interview series, we're meeting some of the AAAI/SIGAI Doctoral Consortium participants to find out more about their research. Tanmay Ambadkar is researching the reward structure in reinforcement learning, with the goal of providing generalizable solutions that can provide robust guarantees and are easily deployable. We caught up with Tanmay to find out more about his research, and in particular, the constrained reinforcement learning framework he has been working on. Tell us a bit about your PhD - where are you studying, and what is the topic of your research? I am a 4th year PhD candidate at The Pennsylvania State University, PA, USA.
- North America > United States > Pennsylvania (0.25)
- Asia > Singapore (0.05)
Starmer 'appeasing' big tech firms, says online safety campaigner
Starmer'appeasing' big tech firms, says online safety campaigner A leading campaigner has accused the prime minister of appeasing big tech companies and being late to the party in regulating social media and artificial intelligence. Crossbench peer Baroness Kidron told the BBC Sir Keir Starmer needed to get on with it rather than launching more consultations. She also criticised the PM for citing his own experience as a father of two teenage children on social media, arguing that this did not make him an expert on the subject and that his family were sheltered compared to others. The government rejected the claims, with a spokesperson saying it had already introduced some of the strongest online safety protections in the world. Sir Keir has launched a consultation on banning under-16s from social media and promised to crackdown on the addictive elements of the apps.
- North America > Central America (0.15)
- Oceania > Australia (0.06)
- Europe > United Kingdom > Wales (0.05)
- (12 more...)
- Leisure & Entertainment (1.00)
- Government > Regional Government > Europe Government > United Kingdom Government (0.72)
- Media > Film (0.71)
- North America > United States > Missouri > St. Louis County > St. Louis (0.04)
- Europe > Hungary > Budapest > Budapest (0.04)
- Europe > Germany > Baden-Württemberg > Karlsruhe Region > Heidelberg (0.04)
- North America > United States > Illinois > Champaign County > Urbana (0.04)
- North America > United States > California > Monterey County > Monterey (0.04)
- Europe > Germany > Baden-Württemberg > Karlsruhe Region > Heidelberg (0.04)
- Asia > Japan > Honshū > Kantō > Ibaraki Prefecture > Tsukuba (0.40)
- North America > United States > California (0.14)
- North America > United States > Alaska (0.04)
- (10 more...)
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.70)
- Information Technology > Artificial Intelligence > Representation & Reasoning > Optimization (0.68)
- North America > United States > California > San Diego County > San Diego (0.04)
- Europe > Germany > Baden-Württemberg > Karlsruhe Region > Heidelberg (0.04)
- North America > United States > California > San Francisco County > San Francisco (0.14)
- North America > United States > California > Los Angeles County > Los Angeles (0.14)
- North America > United States > Texas > Travis County > Austin (0.04)
- (7 more...)
- Transportation (1.00)
- Leisure & Entertainment (0.92)
- Information Technology (0.68)
- Automobiles & Trucks (0.68)