Goto

Collaborating Authors

 Large Language Model


AI Bots Are Now a Signifigant Source of Web Traffic

WIRED

New data shows AI bots pushing deeper into the web, prompting publishers to roll out more aggressive defenses. The viral virtual assistant OpenClaw--formerly known as Moltbot, and before that Clawdbot--is a symbol of a broader revolution underway that could fundamentally alter how the internet functions. Instead of a place primarily inhabited by humans, the web may very soon be dominated by autonomous AI bots. A new report measuring bot activity on the web, as well as related data shared with WIRED by the internet infrastructure company Akamai, shows that AI bots already account for a meaningful share of web traffic. The findings also shed light on an increasingly sophisticated arms race unfolding as bots deploy clever tactics to bypass website defenses meant to keep them out.


HHS Is Making an AI Tool to Create Hypotheses About Vaccine Injury Claims

WIRED

Experts worry Robert F. Kennedy Jr.'s Health Department will use an internal AI tool to analyze vaccine injury claims in a way that furthers his anti-vaccine agenda. The US Department of Health and Human Services is developing a generative artificial intelligence tool to find patterns across data reported to a national vaccine monitoring database and to generate hypotheses on the negative effects of vaccines, according to an inventory released last week of all use cases the agency had for AI in 2025. The tool has not yet been deployed, according to the HHS document, and an AI inventory report from the previous year shows that it has been in development since late 2023. But experts worry that the predictions it generates could be used by Health and Human Services secretary Robert F. Kennedy Jr. to further his anti-vaccine agenda. A long-standing vaccine critic, Kenedy has upended the childhood vaccination schedule in his year in office, removing several shots from a list of recommended immunizations for all children, including those for Covid-19, influenza, hepatitis A and B, meningococcal disease, rotavirus, and respiratory syncytial virus, or RSV.


Evaluating LLMs When They Do Not Know the Answer: Statistical Evaluation of Mathematical Reasoning via Comparative Signals

arXiv.org Machine Learning

Evaluating mathematical reasoning in LLMs is constrained by limited benchmark sizes and inherent model stochasticity, yielding high-variance accuracy estimates and unstable rankings across platforms. On difficult problems, an LLM may fail to produce a correct final answer, yet still provide reliable pairwise comparison signals indicating which of two candidate solutions is better. We leverage this observation to design a statistically efficient evaluation framework that combines standard labeled outcomes with pairwise comparison signals obtained by having models judge auxiliary reasoning chains. Treating these comparison signals as control variates, we develop a semiparametric estimator based on the efficient influence function (EIF) for the setting where auxiliary reasoning chains are observed. This yields a one-step estimator that achieves the semiparametric efficiency bound, guarantees strict variance reduction over naive sample averaging, and admits asymptotic normality for principled uncertainty quantification. Across simulations, our one-step estimator substantially improves ranking accuracy, with gains increasing as model output noise grows. Experiments on GPQA Diamond, AIME 2025, and GSM8K further demonstrate more precise performance estimation and more reliable model rankings, especially in small-sample regimes where conventional evaluation is pretty unstable.


Efficient Variance-reduced Estimation from Generative EHR Models: The SCOPE and REACH Estimators

arXiv.org Machine Learning

Generative models trained using self-supervision of tokenized electronic health record (EHR) timelines show promise for clinical outcome prediction. This is typically done using Monte Carlo simulation for future patient trajectories. However, existing approaches suffer from three key limitations: sparse estimate distributions that poorly differentiate patient risk levels, extreme computational costs, and high sampling variance. We propose two new estimators: the Sum of Conditional Outcome Probability Estimator (SCOPE) and Risk Estimation from Anticipated Conditional Hazards (REACH), that leverage next-token probability distributions discarded by standard Monte Carlo. We prove both estimators are unbiased and that REACH guarantees variance reduction over Monte Carlo sampling for any model and outcome. Empirically, on hospital mortality prediction in MIMIC-IV using the ETHOS-ARES framework, SCOPE and REACH match 100-sample Monte Carlo performance using only 10-11 samples (95% CI: [9,11]), representing a ~10x reduction in inference cost without degrading calibration. For ICU admission prediction, efficiency gains are more modest (~1.2x), which we attribute to the outcome's lower "spontaneity," a property we characterize theoretically and empirically. These methods substantially improve the feasibility of deploying generative EHR models in resource-constrained clinical settings.


Universal One-third Time Scaling in Learning Peaked Distributions

arXiv.org Machine Learning

Training large language models (LLMs) is computationally expensive, partly because the loss exhibits slow power-law convergence whose origin remains debatable. Through systematic analysis of toy models and empirical evaluation of LLMs, we show that this behavior can arise intrinsically from the use of softmax and cross-entropy. When learning peaked probability distributions, e.g., next-token distributions, these components yield power-law vanishing losses and gradients, creating a fundamental optimization bottleneck. This ultimately leads to power-law time scaling of the loss with a universal exponent of $1/3$. Our results provide a mechanistic explanation for observed neural scaling and suggest new directions for improving LLM training efficiency.


Anytime Pretraining: Horizon-Free Learning-Rate Schedules with Weight Averaging

arXiv.org Machine Learning

Large language models are increasingly trained in continual or open-ended settings, where the total training horizon is not known in advance. Despite this, most existing pretraining recipes are not anytime: they rely on horizon-dependent learning rate schedules and extensive tuning under a fixed compute budget. In this work, we provide a theoretical analysis demonstrating the existence of anytime learning schedules for overparameterized linear regression, and we highlight the central role of weight averaging - also known as model merging - in achieving the minimax convergence rates of stochastic gradient descent. We show that these anytime schedules polynomially decay with time, with the decay rate determined by the source and capacity conditions of the problem. Empirically, we evaluate 150M and 300M parameter language models trained at 1-32x Chinchilla scale, comparing constant learning rates with weight averaging and $1/\sqrt{t}$ schedules with weight averaging against a well-tuned cosine schedule. Across the full training range, the anytime schedules achieve comparable final loss to cosine decay. Taken together, our results suggest that weight averaging combined with simple, horizon-free step sizes offers a practical and effective anytime alternative to cosine learning rate schedules for large language model pretraining.


Self-Hinting Language Models Enhance Reinforcement Learning

arXiv.org Machine Learning

Group Relative Policy Optimization (GRPO) has recently emerged as a practical recipe for aligning large language models with verifiable objectives. However, under sparse terminal rewards, GRPO often stalls because rollouts within a group frequently receive identical rewards, causing relative advantages to collapse and updates to vanish. We propose self-hint aligned GRPO with privileged supervision (SAGE), an on-policy reinforcement learning framework that injects privileged hints during training to reshape the rollout distribution under the same terminal verifier reward. For each prompt $x$, the model samples a compact hint $h$ (e.g., a plan or decomposition) and then generates a solution $τ$ conditioned on $(x,h)$. Crucially, the task reward $R(x,τ)$ is unchanged; hints only increase within-group outcome diversity under finite sampling, preventing GRPO advantages from collapsing under sparse rewards. At test time, we set $h=\varnothing$ and deploy the no-hint policy without any privileged information. Moreover, sampling diverse self-hints serves as an adaptive curriculum that tracks the learner's bottlenecks more effectively than fixed hints from an initial policy or a stronger external model. Experiments over 6 benchmarks with 3 LLMs show that SAGE consistently outperforms GRPO, on average +2.0 on Llama-3.2-3B-Instruct, +1.2 on Qwen2.5-7B-Instruct and +1.3 on Qwen3-4B-Instruct. The code is available at https://github.com/BaohaoLiao/SAGE.


ChatGPT is back up after an outage disrupted use this afternoon

Engadget

Claude also went down for many users earlier today. If you had trouble using ChatGPT today, you aren't alone. The AI chatbot experienced a partial outage for many users this afternoon, with Down Detector saw reports reaching more than 12,000 reports around the peak point of the issue today.. OpenAI issued a status update shortly after noting that elevated error rates were occurring for ChatGPT and Platform users. That problem was marked as resolved at 5:14PM ET. While the initial outage may be repaired, OpenAI does still have an active status alert up. But the end may also be in sight for that final issue, because the current statement from the company is We have applied the mitigation and are monitoring the recovering.


ChatGPT is down for many users this afternoon

Engadget

We are working on implementing a mitigation, OpenAI said in a status update. If you've had trouble using ChatGPT today, you aren't alone. The AI chatbot is experiencing a partial outage for many users this afternoon. Down Detector reports of issues with the service leapt from almost nothing to more than 12,000 around 3PM ET. OpenAI issued a status update noting that elevated error rates are occurring for ChatGPT and Platform users.


Apple just made Xcode better for vibe coding

Engadget

Xcode 26.3 brings more robust support for programming agents like Claude. Apple has just released Xcode 26.3, and it's a big step forward in terms of the company's support of coding agents. The new release expands on the AI features the company introduced with Xcode 26 at WWDC 2025 to give systems like Claude and ChatGPT more robust access to its in-house IDE. With the update, Apple says Claude and OpenAI's Codex can search documentation, explore file structures, update project settings, and verify their work visually by capturing Xcode Previews and iterating through builds and fixes. This is in contrast to earlier releases of Xcode 26 where those same agents were limited in what they could see of a developer's Xcode environment, restricting their utility.