Goto

Collaborating Authors

 Generative AI


Meta Poaches Key Google AI Researcher

TIME - Tech

Upon its release earlier this month, OpenAI's Sora 2 model took the Internet by storm, thanks to its ability to generate realistic videos from just a text prompt. But Sora is about more than just capturing eyeballs with viral content. "On the surface, Sora, for example, does not look like it is AGI-relevant," OpenAI CEO Sam Altman said on a podcast earlier this month. "But I would bet that if we can build really great world models, that will be much more important to AGI than people think." Altman was speaking to a growing belief inside the AI industry at large: that if you can simulate the world with enough accuracy, you could drop AI agents into those simulations. There, they could learn more skills than they currently can from just text, photos, and videos--because they could interact with a simulated world. That form of training could be highly efficient, in part because simulated time can be accelerated, and because many simulations can be run in parallel.


Salesforce's CEO backtracks after saying Trump should send troops into San Francisco

The Guardian

Salesforce's CEO backtracks after saying Trump should send troops into San Francisco In tech this week: The CEO of the city's largest private employer apologizes, Amazon Web Services' outage and OpenAI's Sora makes waves What I'm watching this week: South Park's caricature of Peter Thiel and his obsession with the antichrist . Read our reporting on the show's inspiration: Thiel's bizarre off-the-record lectures on the subject. And now, let's get into things. The co-founder and CEO of Salesforce, said last week that Donald Trump should make good on his threats to send the US national guard into San Francisco, despite resistance from local leaders. Even Marc Benioff's own public relations manager was aghast at his remarks, according to the New York Times .


Bryan Cranston thanks OpenAI for cracking down on Sora 2 deepfakes

The Guardian

Bryan Cranston pictured speaking at a Sag-Aftra strike rally in 2023 in New York. The Breaking Bad actor went to the union with concerns after users of OpenAI's generative video platform Sora 2 were able to generate his likeness without his consent. Bryan Cranston pictured speaking at a Sag-Aftra strike rally in 2023 in New York. The Breaking Bad actor went to the union with concerns after users of OpenAI's generative video platform Sora 2 were able to generate his likeness without his consent. Users of generative AI video app were able to recreate the Breaking Bad actor's likeness without his consent, which OpenAI called'unintentional' Bryan Cranston has said he is "grateful" to OpenAI for cracking down on deepfakes of himself on the company's generative AI video platform Sora 2, after users were able to generate his voice and likeness without his consent.


VERINA: Benchmarking Verifiable Code Generation

arXiv.org Artificial Intelligence

Large language models (LLMs) are increasingly integrated in software development, but ensuring correctness in LLM-generated code remains challenging and often requires costly manual review. Verifiable code generation -- jointly generating code, specifications, and proofs of code-specification alignment -- offers a promising path to address this limitation and further unleash LLMs' benefits in coding. Yet, there exists a significant gap in evaluation: current benchmarks often focus on only individual components rather than providing a holistic evaluation framework of all tasks. In this paper, we introduce Verina (Verifiable Code Generation Arena), a high-quality benchmark enabling a comprehensive and modular evaluation of code, specification, and proof generation as well as their compositions. Verina consists of 189 manually curated coding tasks in Lean, with detailed problem descriptions, reference implementations, formal specifications, and extensive test suites. Our extensive evaluation of state-of-the-art LLMs reveals significant challenges in verifiable code generation, especially in proof generation, underscoring the need for improving LLM-based theorem provers in verification domains. The best model, OpenAI o4-mini, achieves a 61.4\% code correctness rate, 51.0\% for specification soundness and completeness, and a mere 3.6\% proof success rate (based on one trial per task). We hope Verina will catalyze progress in verifiable code generation by providing a rigorous and comprehensive benchmark. We release our dataset on https://huggingface.co/datasets/sunblaze-ucb/verina and our evaluation code on https://github.com/sunblaze-ucb/verina.


Agentic Reinforcement Learning for Search is Unsafe

arXiv.org Artificial Intelligence

Agentic reinforcement learning (RL) trains large language models to autonomously call tools during reasoning, with search as the most common application. These models excel at multi-step reasoning tasks, but their safety properties are not well understood. In this study, we show that RL-trained search models inherit refusal from instruction tuning and often deflect harmful requests by turning them into safe queries. However, this safety is fragile. Two simple attacks, one that forces the model to begin response with search (Search attack), another that encourages models to repeatedly search (Multi-search attack), trigger cascades of harmful searches and answers. The attacks succeed by triggering models to generate harmful, request-mirroring search queries before they can generate the inherited refusal tokens. This exposes a core weakness of current RL training: it rewards continued generation of effective queries without accounting for their harmfulness. As a result, RL search models have vulnerabilities that users can easily exploit, making it urgent to develop safety-aware agentic RL pipelines optimising for safe search. Instruction tuning (IT) is the standard method to align large language models (LLMs) with human preferences and teach them to refuse harmful requests (Schulman et al., 2017; Shao et al., 2024). However, IT only shapes static responses and is insufficient in agentic settings, where models must also decide when and how to call external tools, capabilities not explicitly learned during pre-training (Zhang et al., 2025). Agentic reinforcement learning (RL) for tool-use (Zhang et al., 2025) tackles this by fine-tuning models to interleave reasoning with tool use (Dong et al., 2025). In practice, search is the most common tool: agentic RL rewards effective, well-timed search queries and achieves strong gains on multi-hop reasoning tasks (Song et al., 2025a;b; Jin et al., 2025). Despite the progress, effect of agentic RL on safety of IT models remains unclear. While prior work reported safety degradation of retrieval-augmented agents (Y u et al., 2025), little is known about whether agentic RL for search preserves refusal of harmful requests. As agentic RL is now deployed in closed-source systems such as OpenAI's DeepSearch (OpenAI, 2025), this evaluation gap can create real deployment risks.


BenCao: An Instruction-Tuned Large Language Model for Traditional Chinese Medicine

arXiv.org Artificial Intelligence

Traditional Chinese Medicine (TCM), with a history spanning over two millennia, plays a role in global healthcare. However, applying large language models (LLMs) to TCM remains challenging due to its reliance on holistic reasoning, implicit logic, and multimodal diagnostic cues. Existing TCM-domain LLMs have made progress in text-based understanding but lack multimodal integration, interpretability, and clinical applicability. To address these limitations, we developed BenCao, a ChatGPT-based multimodal assistant for TCM, integrating structured knowledge bases, diagnostic data, and expert feedback refinement. BenCao was trained through natural language instruction tuning rather than parameter retraining, aligning with expert-level reasoning and ethical norms specific to TCM. The system incorporates a comprehensive knowledge base of over 1,000 classical and modern texts, a scenario-based instruction framework for diverse interactions, a chain-of-thought simulation mechanism for interpretable reasoning, and a feedback refinement process involving licensed TCM practitioners. BenCao connects to external APIs for tongue-image classification and multimodal database retrieval, enabling dynamic access to diagnostic resources. In evaluations across single-choice question benchmarks and multimodal classification tasks, BenCao achieved superior accuracy to general-domain and TCM-domain models, particularly in diagnostics, herb recognition, and constitution classification. The model was deployed as an interactive application on the OpenAI GPTs Store, accessed by nearly 1,000 users globally as of October 2025. This study demonstrates the feasibility of developing a TCM-domain LLM through natural language-based instruction tuning and multimodal integration, offering a practical framework for aligning generative AI with traditional medical reasoning and a scalable pathway for real-world deployment.


Schrรถdinger Bridge Mamba for One-Step Speech Enhancement

arXiv.org Artificial Intelligence

ABSTRACT We propose Schr odinger Bridge Mamba (SBM), a new concept of training-inference framework motivated by the inherent compatibility between Schr odinger Bridge (SB) training paradigm and selective state-space model Mamba. Experiments on a joint denoising and dereverberation task using four benchmark datasets demonstrate that SBM, with only 1-step inference, outperforms strong baselines with 1-step or iterative inference and achieves the best real-time factor (RTF). Beyond speech enhancement, we discuss the integration of SB paradigm and selective state-space model architecture based on their underlying alignment, which indicates a promising direction for exploring new deep generative models potentially applicable to a broad range of generative tasks. Index T erms-- Schr odinger Bridge, Mamba, Deep generative model, Speech enhancement 1. INTRODUCTION Deep generative models have been increasingly employed for speech enhancement (SE) tasks. By learning the underlying distribution of clean audio given its degraded counterpart, generative models are capable of generating high-quality speech from low-quality inputs that include noise, reverberation, clipping, bandwidth limitation or a mixture of these artifacts.


In Generative AI We (Dis)Trust? Computational Analysis of Trust and Distrust in Reddit Discussions

arXiv.org Artificial Intelligence

The rise of generative AI (GenAI) has impacted many aspects of human life. As these systems become embedded in everyday practices, understanding public trust in them also becomes essential for responsible adoption and governance. Prior work on trust in AI has largely drawn from psychology and human-computer interaction, but there is a lack of computational, large-scale, and longitudinal approaches to measuring trust and distrust in GenAI and large language models (LLMs). This paper presents the first computational study of Trust and Distrust in GenAI, using a multi-year Reddit dataset (2022--2025) spanning 39 subreddits and 197,618 posts. Crowd-sourced annotations of a representative sample were combined with classification models to scale analysis. We find that Trust and Distrust are nearly balanced over time, with shifts around major model releases. Technical performance and usability dominate as dimensions, while personal experience is the most frequent reason shaping attitudes. Distinct patterns also emerge across trustors (e.g., experts, ethicists, general users). Our results provide a methodological framework for large-scale Trust analysis and insights into evolving public perceptions of GenAI.


OpenAI's Sora Underscores the Growing Threat of Deepfakes

TIME - Tech

When OpenAI released its AI video-generation app, Sora, in September, it promised that "you are in control of your likeness end-to-end." The app allows users to include themselves and their friends in videos through a feature called "cameos"--the app scans a user's face and performs a liveness check, providing data to generate a video of the user and to authenticate their consent for friends to use their likeness on the app. But Reality Defender, a company specializing in identifying deepfakes, says it was able to bypass Sora's anti-impersonation safeguards within 24 hours. Platforms such as Sora give a "plausible sense of security," says Reality Defender CEO Ben Colman, despite the fact that "anybody can use completely off-the-shelf tools" to pass authentication as someone else. Reality Defender's researchers used publicly available footage of notable individuals, including CEOs and entertainers, from earnings calls and media interviews.


NVIDIA RTX 5090 outperforms AMD and Apple running local OpenAI language models

PCWorld

When you purchase through links in our articles, we may earn a small commission. Developers and creatives looking for greater control and privacy with their AI are increasingly turning to locally run models like OpenAI's new gpt-oss family of models, which are both lightweight and incredibly functional on end-user hardware. Indeed, you can have it run on consumer GPUs with just 16GB of memory. That makes it possible to use a wide range of hardware - with NVIDIA GPUs emerging as the best way to run these sorts of open-weight models. While nations and companies rush to develop their own bespoke AI solutions to a range of tasks, open source and open-weight models like OpenAI's new gpt-oss-20b are finding much more adoption.