Goto

Collaborating Authors

 Law


Oyster-I: Beyond Refusal -- Constructive Safety Alignment for Responsible Language Models

arXiv.org Artificial Intelligence

Large language models (LLMs) typically deploy safety mechanisms to prevent harmful content generation. Most current approaches focus narrowly on risks posed by malicious actors, often framing risks as adversarial events and relying on defensive refusals. However, in real-world settings, risks also come from non-malicious users seeking help while under psychological distress (e.g., self-harm intentions). In such cases, the model's response can strongly influence the user's next actions. Simple refusals may lead them to repeat, escalate, or move to unsafe platforms, creating worse outcomes. We introduce Constructive Safety Alignment (CSA), a human-centric paradigm that protects against malicious misuse while actively guiding vulnerable users toward safe and helpful results. Implemented in Oyster-I (Oy1), CSA combines game-theoretic anticipation of user reactions, fine-grained risk boundary discovery, and interpretable reasoning control, turning safety into a trust-building process. Oy1 achieves state-of-the-art safety among open models while retaining high general capabilities. On our Constructive Benchmark, it shows strong constructive engagement, close to GPT-5, and unmatched robustness on the Strata-Sword jailbreak dataset, nearing GPT-o1 levels. By shifting from refusal-first to guidance-first safety, CSA redefines the model-user relationship, aiming for systems that are not just safe, but meaningfully helpful. We release Oy1, code, and the benchmark to support responsible, user-centered AI.


ChatGPT will soon allow erotica for verified adults, says OpenAI boss

BBC News

OpenAI plans to allow a wider range of content, including erotica, on its popular chatbot ChatGPT as part of its push to treat adult users like adults, says its boss Sam Altman. In a post on X on Tuesday, Mr Altman said upcoming versions of the popular chatbot would enable it to behave in a more human-like way - but only if you want it, not because we are usage maxxing. The move, reminiscent of Elon Musk's xAI recent introduction of two sexually explicit chatbots to Grok, could help OpenAI attract more paying subscribers. It is also likely to intensify pressure on lawmakers to introduce tighter restrictions on chatbot companions. OpenAI did not respond to the BBC's requests for comment following Mr Altman's post.



Spot the difference: Apple has rebranded its TV service as part of a 'vibrant new identity' - so, can you see what has changed?

Daily Mail - Science & tech

Hamas executes'collaborators' in Gaza as it clings to power amid fears Trump's peace deal is already at risk Internet star who demanded free seats for fat fliers vanished without trace... now the Daily Mail has learned the heartbreaking reason why Donald Trump tells crowds there are world leaders he'doesn't like at ALL' as he teases who they are How Diane Keaton's closest friend helped her to achieve her'lifelong ambition' just months before she died - and the poignant legacy it leaves Kate and Wills' fresh start at their'forever home': Why they have fast-tracked their move to house they will never leave - even when he becomes King'It's Meghan Markle 3.0': Why the duchess has set tongues wagging that she's plotting another Sussex relaunch'as she holds cosy meeting with new editor of US Vogue' Trump's ominous warning to Macron at Egypt summit: 'You will see what is about to happen' Neil Diamond, 84, sang Sweet Caroline and worked with Cher as well as Barbra Streisand... see him now Insiders reveal how reluctant Katy Perry finally gave in to'persistent' Justin Trudeau... as sexy yacht photos get spicy response from his ex-wife Awkward moment Donald Trump asks Giorgia Meloni'You won't be offended if I say you're beautiful, right? Horrors endured by Israel's last 20 hostages: Chained, tortured, and starved. Lindsey Halligan removes senior DOJ official after taking over Virginia US attorney's office Gorgeous Bay Area enclave filled with hippies becomes America's ANGRIEST town over plans for huge affordable housing project MLB fans hail'greatest play in baseball HISTORY' after Dodgers thought they hit grand slam in Brewers game Father launches campaign to become sheriff as he faces murder trial for killing teenage daughter's abuser Spot the difference: Apple has rebranded its TV service as part of a'vibrant new identity' - so, can you see what has changed? But Apple TV+ is no more - as Apple has quietly rebranded its streaming service. 'Apple TV+ is now simply Apple TV, with a vibrant new identity,' the tech giant explained in the bottom of a press release on the streaming debut of its film, 'F1 The Movie'.


Can we repair the internet?

MIT Technology Review

Can we repair the internet? Three new books propose remedies that run the gamut from government regulation to user responsibility. From addictive algorithms to exploitative apps, data mining to misinformation, the internet today can be a hazardous place. Books by three influential figures--the intellect behind "net neutrality," a former Meta executive, and the web's own inventor--propose radical approaches to fixing it. But are these luminaries the right people for the job? Though each shows conviction, and even sometimes inventiveness, the solutions they present reveal blind spots.


When Thinking Backfires: Mechanistic Insights Into Reasoning-Induced Misalignment

arXiv.org Artificial Intelligence

With the growing accessibility and wide adoption of large language models, concerns about their safety and alignment with human values have become paramount. In this paper, we identify a concerning phenomenon: Reasoning-Induced Misalignment (RIM), in which misalignment emerges when reasoning capabilities strengthened-particularly when specific types of reasoning patterns are introduced during inference or training. Beyond reporting this vulnerability, we provide the first mechanistic account of its origins. Through representation analysis, we discover that specific attention heads facilitate refusal by reducing their attention to CoT tokens, a mechanism that modulates the model's rationalization process during inference. During training, we find significantly higher activation entanglement between reasoning and safety in safety-critical neurons than in control neurons, particularly after fine-tuning with those identified reasoning patterns. This entanglement strongly correlates with catastrophic forgetting, providing a neuron-level explanation for RIM.


Calibrating Generative Models

arXiv.org Machine Learning

Generative models frequently suffer miscalibration, wherein class probabilities and other statistics of the sampling distribution deviate from desired values. We frame calibration as a constrained optimization problem and seek the closest model in Kullback-Leibler divergence satisfying calibration constraints. To address the intractability of imposing these constraints exactly, we introduce two surrogate objectives for fine-tuning: (1) the relax loss, which replaces the constraint with a miscalibration penalty, and (2) the reward loss, which converts calibration into a reward fine-tuning problem. We demonstrate that these approaches substantially reduce calibration error across hundreds of simultaneous constraints and models with up to one billion parameters, spanning applications in protein design, image generation, and language modeling.


Revisiting Trust in the Era of Generative AI: Factorial Structure and Latent Profiles

arXiv.org Artificial Intelligence

Trust is one of the most important factors shaping whether and how people adopt and rely on artificial intelligence (AI). Yet most existing studies measure trust in terms of functionality, focusing on whether a system is reliable, accurate, or easy to use, while giving less attention to the social and emotional dimensions that are increasingly relevant for today's generative AI (GenAI) systems. These systems do not just process information; they converse, respond, and collaborate with users, blurring the line between tool and partner. In this study, we introduce and validate the Human-AI Trust Scale (HAITS), a new measure designed to capture both the rational and relational aspects of trust in GenAI. Drawing on prior trust theories, qualitative interviews, and two waves of large-scale surveys in China and the United States, we used exploratory (n = 1,546) and confirmatory (n = 1,426) factor analyses to identify four key dimensions of trust: Affective Trust, Competence Trust, Benevolence & Integrity, and Perceived Risk. We then applied latent profile analysis to classify users into six distinct trust profiles, revealing meaningful differences in how affective-competence trust and trust-distrust frameworks coexist across individuals and cultures. Our findings offer a validated, culturally sensitive tool for measuring trust in GenAI and provide new insight into how trust evolves in human-AI interaction. By integrating instrumental and relational perspectives of trust, this work lays the foundation for more nuanced research and design of trustworthy AI systems.


ACADREASON: Exploring the Limits of Reasoning Models with Academic Research Problems

arXiv.org Artificial Intelligence

In recent years, the research focus of large language models (LLMs) and agents has shifted increasingly from demonstrating novel capabilities to complex reasoning and tackling challenging tasks. However, existing evaluations focus mainly on math/code contests or general tasks, while existing multi-domain academic benchmarks lack sufficient reasoning depth, leaving the field without a rigorous benchmark for high-level reasoning. To fill this gap, we introduce the Acadreason benchmark, designed to evaluate the ability of LLMs and agents to acquire and reason over academic knowledge. It consists of 50 expert-annotated academic problems across five high-reasoning domains, including computer science, economics, law, mathematics, and philosophy. All questions are sourced from top-tier publications in recent years and undergo rigorous annotation and quality control to ensure they are both challenging and answerable. We conduct systematic evaluations of over 10 mainstream LLMs and agents. The results show that most LLMs scored below 20 points, with even the cutting-edge GPT-5 achieving only 16 points. While agents achieved higher scores, none exceeded 40 points. This demonstrates the current capability gap between LLMs and agents in super-intelligent academic research tasks and highlights the challenges of Acadreason.


High-Power Training Data Identification with Provable Statistical Guarantees

arXiv.org Artificial Intelligence

The conventional approaches treat it as a simple binary classification task without statistical guarantees. A recent approach is designed to control the false discovery rate (FDR), but its guarantees rely on strong, easily violated assumptions. In this paper, we introduce Provable Training Data Identification (PTDI), a rigorous method that identifies a set of training data with strict false discovery rate (FDR) control. Specifically, our method computes p-values for each data point using a set of known unseen data, and then constructs a conservative estimator for the data usage proportion of the test set, which allows us to scale these p-values. Our approach then selects the final set of training data by identifying all points whose scaled p-values fall below a data-dependent threshold. This entire procedure enables the discovery of training data with provable, strict FDR control and significantly boosted power. Extensive experiments across a wide range of models (LLMs and VLMs), and datasets demonstrate that PTDI strictly controls the FDR and achieves higher power. These concerns raise the importance of identifying a specific, well-defined set of data allegedly used in training. To resolve such high-stakes disputes, claims must be supported by credible evidence that strictly controls the risk of false positives. This underscores the need for methods that provide rigorous statistical guarantees for identifying training data.