Large Language Model
OpenAI floats idea of global AI governance body with U.S. and China
OpenAI floats idea of global AI governance body with U.S. and China The U.S. has an opportunity to use its lead in artificial intelligence technology to create a global governance mechanism to ensure safer, more resilient systems, OpenAI's vice president of global affairs, Chris Lehane, said. OpenAI would support the creation of a global governance body for artificial intelligence led by the U.S. and including China as a member, a top company executive said, hours before the start of U.S. President Donald Trump's high-stakes meeting with Chinese President Xi Jinping. When asked about the China summit, OpenAI's vice president of global affairs, Chris Lehane, said Wednesday that the U.S. has an opportunity to use its lead in AI technology to create a global governance mechanism resulting in safer, more resilient systems. "AI, in some level, transcends a lot of the prevailing or traditional trade type of issues," Lehane told reporters during a briefing at the company's offices in Washington. "There is an opportunity to really start to build something up globally, and have countries around the world, including China, potentially participate." In a time of both misinformation and too much information, quality journalism is more crucial than ever.
Microsoft is retiring Copilot Mode on Edge, because everything is Copilot Mode now
Microsoft is retiring Copilot Mode on Edge, because its features are now built directly into the browser for both desktop and mobile. If you'll recall, Microsoft started testing Copilot Mode on Edge in July last year, allowing you to use it to search for information across multiple open browser tabs and to analyze the details on each page. Now, the feature is available not just on desktop, but also on Edge for mobile. Just ask Copilot a question or give it a command, such as Compare the smart TVs across all my open tabs, and it will pull info from your tabs to give you a structured, side-by-side comparison analysis. After the initial testing of Copilot Mode, Microsoft rolled out Journeys, which you can use to save projects you can revisit in the future. It's now also available for free on mobile, so you can pick up planning trips or making purchases from where you left off days or weeks ago.
Uncovering Symmetry Transfer in Large Language Models via Layer-Peeled Optimization
Du, Zhehang, He, Hangfeng, Su, Weijie
Large language models (LLMs) are pretrained by minimizing the cross-entropy loss for next-token prediction. In this paper, we study whether this optimization strategy can induce geometric structure in the learned model weights and context embeddings. We approach this problem by analyzing a constrained layer-peeled optimization program, which serves as a mathematically tractable surrogate for LLMs by treating the output projection matrix and last-layer context embeddings as optimization variables. Our analysis of this nonconvex optimization program demonstrates that symmetries in the target next-token distributions are transferred to the global minimizers of the layer-peeled model in a precise group-theoretic sense. Specifically, we prove that when the target tokens exhibit a cyclic-shift symmetry (such as the seven days of the week or the twelve months of the year), the optimal logit matrix is exactly circulant, and the Gram matrices of both the output projections and the context embeddings form circulant geometries as well. Next, for exchangeable target distributions invariant under the symmetric group and, more generally, under two-transitive group actions, we show that the global optimal output projection matrix forms a simplex equiangular tight frame, while the optimal logit matrix and context embeddings inherit the permutation symmetries present in the input data. A key technical step is to reduce the constrained nonconvex factorized problem to an explicit logit-level convex characterization for cyclic symmetry and to a symmetry-based lower bound for permutation symmetry, together with a sharp characterization of the optimal factorization. Finally, we empirically demonstrate that open-source LLMs naturally exhibit symmetries consistent with our theoretical predictions, despite being trained without any explicit regularization promoting such geometric structure.
When Should an AI Workflow Release? Always-Valid Inference for Black-Box Generate-Verify Systems
Cho, Young Hyun, Sun, Will Wei
LLM-enabled AI workflows increasingly produce outputs through iterative generate-evaluate-revise loops. Each iteration can improve the candidate, but it also creates a release decision: when to stop and output the current result? This raises a statistical challenge because deployment-time evaluator scores are adaptively generated and repeatedly monitored, yet the likelihood models or exchangeability assumptions typically used for calibration are unavailable. We propose an always-valid release wrapper for existing generator-evaluator pipelines. The wrapper builds a hard-negative reference pool of high-scoring failures, calibrates deployment-time evaluator scores against this pool, and accumulates the resulting evidence with an e-process. This separates two roles: the reference pool turns black-box scores into conservative evidence, while the e-process provides validity under optional stopping. In theory, we show that a conservative reference pool yields finite-sample control of the probability of releasing on infeasible tasks, that is, tasks for which the given workflow is not capable of producing a reliable solution. We also characterize conditions under which the same conservative rule still achieves nontrivial release on feasible tasks. In an MBPP+ coding-agent case study, the wrapper reduces premature incorrect release relative to baseline stopping rules while still releasing on tasks for which the workflow repeatedly accumulates moderate supporting evidence.
LLMs as Implicit Imputers: Uncertainty Should Scale with Missing Information
Large language models (LLMs) are increasingly deployed in settings where the available context is incomplete or degraded. We argue that an LLM generating answers under incomplete context can be viewed as an implicit imputer, and evaluated against a criterion from the multiple imputation (MI) literature: uncertainty should scale with the amount of missing information. We assess this criterion on SQuAD, using a controlled framework in which context availability is varied across five levels. We evaluate two answer-level uncertainty measures that can be estimated from repeated sampling: sampling-based confidence (empirical mode frequency) and response entropy. Confidence fails to reflect increasing missingness: it remains high even as accuracy collapses. Entropy, by contrast, increases with context removal, consistent with the MI analogy, and explains substantially more variance in accuracy than confidence across all evidence levels (quadratic $R^2$ gap up to 0.057). We further introduce a black-box diagnostic $ρ_R(α)$ that estimates the proportion of baseline uncertainty resolved by context level $α$, requiring only repeated sampling with and without context. These results suggest that entropy is a more responsive black-box uncertainty measure than confidence under incomplete context.
Learning Perturbations to Extrapolate Your LLM
Cen, Zetai, Gu, Chenfei, Zhu, Jin, Li, Ting, Chen, Yunxiao, Shi, Chengchun
Training large language models (LLMs) such as GPT-5 and Qwen-3 (Singh et al., 2025; Yang et al., 2025) on massive text corpora aims at capturing the underlying distribution of natural language. Yet, it remains challenging for the trained model to extrapolate to out-of-distribution or out-of-domain settings beyond the support of its training data. The literature has seen the development of various data perturbation techniques, such as synonym replacement, random insertion, deletion, and swap, that modify training instances into semantically similar variants to effectively expose LLMs to a broader range of inputs and improve their ability to generalize beyond the training data (Feng et al., 2019, 2020; Li et al., 2024; Cen et al., 2026). However, their approach remains grounded in the discrete, word-level augmentation procedures mentioned previously, which may restrict its adaptivity across diverse domains. While discrete perturbations are simple to use, they could be too coarse and hard to refine due to the complexity of natural language (Park et al., 2022; Li et al., 2023). Meanwhile, fixed perturbations apply the same transformations to the data regardless of the contexts, thus failing to generalize appropriately (Ismailov and Asanova, 2025).
A Hierarchical Language Model with Predictable Scaling Laws and Provable Benefits of Reasoning
Gaitonde, Jason, Koehler, Frederic, Mossel, Elchanan, Shin, Joonhyung, Sly, Allan
We introduce a family of synthetic languages with hierarchical structure -- generated by a broadcast process on trees -- for which the role of context length and reasoning in autoregressive generation can be analyzed precisely. At the heart of our analytic approach is an \emph{exact $k$-gram ansatz} in place of transformers with context length $k$, a substitution we then validate empirically. Using this ansatz we derive explicit asymptotic predictions for distributional statistics of the sequences produced by a trained model, instantiated in two settings. For the \emph{Ising broadcast process} (a soft-constrained language), we prove that the variance of the generated sum scales log-linearly in the context depth and its kurtosis converges to that of a Gaussian -- both deviating from the true language for any sublinear context. For the \emph{coloring broadcast process} (a hard-constrained language) in the freezing regime, bounded-context autoregression produces sequences that, with high probability, are inconsistent with \emph{any} valid coloring of the underlying tree. Together these results imply an $Ω(n)$ lower bound on the context length required to faithfully sample length-$n$ sequences. In contrast, we prove that an autoregressive \emph{reasoning} model with only $Θ(\log n)$ working memory can sample exactly from the true language -- an exponential improvement. We confirm both the lower-bound predictions and the reasoning-based upper bound empirically with transformers trained on the synthetic language; the trained models track our asymptotic predictions quantitatively across a wide range of context sizes.
OpenAI endorses the Kids Online Safety Act
OpenAI, which is currently facing a raft of lawsuits over alleged safety lapses in ChatGPT, has endorsed the Kids Online Safety Act (KOSA). The company said that its endorsement was part of a broader commitment to create AI-specific rules for kids safety. OpenAI's endorsement comes as KOSA, which passed the Senate in 2024, appears to be gaining some momentum . KOSA, which was first introduced in 2022, is one of several online safety bills that would require social media companies and other online platforms to implement stronger protections for children. The bill has been revised a number of times, but the current version includes a requirement for social media apps to allow minors to opt out of addictive features and algorithmic recommendations.
OpenAI Brings Its Ass to Court
In, the company sought to show the jury a remarkable trophy as physical proof of Elon Musk's concerning behavior. Wednesday's episode of the trial kicked off on Wednesday with a unique proposition: OpenAI wanted to bring its ass into the courtroom, and lay it bare before the jury. It's a good thing lady justice wears that blindfold. A lawyer for Sam Altman's AI behemoth, Bradley Wilson, approached US district judge Yvonne Gonzalez Rogers and handed her a small gold statue with a white stone base. It depicted the rear end of a donkey--with two legs, a butt, and a tail--and was inscribed with the message, "Never stop being a jackass for safety."
Reports of the Workshops Held at the 2026 AAAI Conference on Artificial Intelligence
The 10th International Workshop on Health Intelligence (W3PHIAI-26) celebrated a decade of bringing AI and health research together, building on a lineage that began with the AAAI-W3PHI workshops focused on population health (2014-2016), the AAAI-HIAI workshops focused on personalized health (2013-2016), and the subsequent joint W3PHIAI workshops held annually from 2017 through 2025. Over this decade, the series has produced hundreds of talks and high-impact publications that have collectively received thousands of citations, shaping the research agenda in both population health intelligence and personalized healthcare AI. This year's special theme, "Foundation Models and AI Agents," reflected the field's rapidly evolving frontier: the emergence of autonomous and semi-autonomous AI systems reshaping clinical workflows, patient management, health system operations, and public health surveillance. Day 1 of the workshop focused on medical imaging and the translation of AI for clinical ...