Large Language Model
OpenAI makes breakthrough on 80-year-old maths problem
If you take a sheet of paper and add some dots, how many pairs can be the same distance apart? If you take a sheet of paper and add some dots, how many pairs can be the same distance apart? OpenAI has claimed a further advance in AI reasoning after its technology successfully tackled an 80-year-old maths problem. The company behind ChatGPT said it had made a breakthrough with a challenge first posed by Hungarian mathematician Paul Erdős in 1946: the planar unit distance problem. The question posed by Erdős is simple to explain.
Anthropic's Code with Claude showed off coding's future--whether you like it or not
Anthropic's Code with Claude showed off coding's future--whether you like it or not As tools like Claude Code get better, more and more developers are happy to hand off coding tasks to them. The way software gets built has changed for good. The vibes were strong at Code with Claude, Anthropic's two-day event for software developers in London that kicked off on May 19, the same day as Google's I/O in Palo Alto. "Who here has shipped a pull request in the last week that was completely written by Claude?" Jeremy Hadfield, an engineer at Anthropic, asked from the main stage. Almost half the people in the packed room--many sitting with laptops on their knees, coding or prompting as they watched the talks--raised their hands. Pull requests are fixes or updates to existing software that are submitted for review before they go live.
The Download: online safety's future and climate tech's big pivot
The Download: online safety's future and climate tech's big pivot Plus: SpaceX has filed for an IPO expected to be the largest ever. For months, the Trump administration has been going after researchers who study and try to counter hate speech, harassment, propaganda, and disinformation online. Now, some of those researchers are fighting back. In a new lawsuit, they're seeking to strike down a visa restriction policy against "foreign officials and other persons" announced last year by US Secretary of State Marco Rubio. They say the policy violates the speech and due process rights of foreign-born workers whose "work supports greater moderation of content on the [tech] platforms. Find out how the case could impact online safety and free speech .
Conformal Selective Acting: Anytime-Valid Risk Control for RLVR-Trained LLMs
Khosravi, Hamed, Huo, Xiaoming
A local specialist LLM, fine-tuned with reinforcement learning from verifiable rewards (RLVR) on operator-local data, is installed in a regulated organization with per-deployment error budget $α$. The operator needs a safety certificate for this deployment's stream at every round: no pooling across deployments, no waiting for a long-run average. Existing wrappers cannot deliver this on adaptive, online-updated streams: offline conformal-risk methods require exchangeability; online-conformal methods bound only long-run averages; non-exchangeable extensions are marginally valid; and the closest anytime wrapper, A-RCPS, controls marginal rather than selective risk. Using a (test statistic, validity guarantee, deployment rule) framework, we identify one empty cell forced by deployment requirements: e-process per threshold, selective risk, anytime-pathwise validity, max-certified-threshold rule. Conformal Selective Acting (CSA) fills it as a per-round wrapper maintaining a Ville-type e-process per threshold on a Bonferroni grid, evaluated against the RLVR filtration. Under predictable updates and isotonic-calibrated monotone risk we prove (i) an anytime-pathwise selective-risk bound $R_T^{\mathrm{act}}\leα+O(N_T^{-1/2})$, (ii) rate-optimal certification matching $Θ(\barη^{-2}\log(1/δ))$, and (iii) a horizon-independent release-rate gap. Across eight specialist benchmarks ($480$ streams), sixteen adversarial distribution-shift cells ($160$ streams), and five live Expert-Iteration RLVR cells with online LoRA over four base models in three architecture families ($10{,}300$ rounds), CSA is the only method among ten compared that satisfies pathwise validity and non-refusing deployment on every cell. We do not propose a new LLM, training algorithm, or policy class; CSA is the deployment-side complement, orthogonal to the model, for operators who cannot use a frontier API.
Interpretable Discriminative Text Representations via Agreement and Label Disentanglement
Wang, Tong, Xu, Yiqing, Yang, Leo Yang
Interpretable text representations should expose coordinates that are not only predictive, but also meaningful enough for independent auditors to apply. Existing discriminative representations often use anonymous embedding directions, while concept-bottleneck and LLM-assisted methods attach natural-language names to features without ensuring that those definitions are reproducible or distinct from the target label. We propose an operational criterion for interpretable discriminative text representations: each coordinate should satisfy conceptual clarity, measured by chance-adjusted agreement between independent annotators applying the feature definition, and label disentanglement, meaning the feature should not merely paraphrase the prediction target. We instantiate this criterion in LLM-assisted Feature Discovery (LFD), an iterative method that proposes lexical and semantic features from contrastive outcome-opposed text pairs, screens candidates using cross-LLM Cohen's $κ$, and selects features by residual held-out predictive gain. A stylized analysis connects the $κ$ screen to a per-feature annotation-noise bound, formalizing agreement as a reliability check. Across ten text-classification tasks spanning seven corpora, LFD matches the predictive performance of a strong text bottleneck baseline while producing substantially clearer and less label-entangled features. Human audits with 232 raters show that LFD features achieve higher human--human and human--LLM agreement than baseline concepts, and raters consistently judge them as less label-leaking. These results suggest that agreement-tested, label-disentangled coordinates provide a practical auditability standard for interpretable text classification.
Federated LoRA Fine-Tuning for LLMs via Collaborative Alignment
He, Shuaida, Chen, Liwen, Feng, Long
Low-rank adaptation (LoRA) has emerged as a powerful tool for parameter-efficient fine-tuning of large language models (LLMs). This paper studies LoRA under a federated learning setting, enabling collaborative fine-tuning across clients while preserving parameter efficiency. We focus on a highly heterogeneous regime in which clients share only partial structure and a substantial subset may be contaminated. We propose Collaborative Low-rank Alignment and Identifiable Recovery (CLAIR), a contamination-aware framework that relies only on preliminary local estimators. Its formulation applies broadly, from linear regression to neural network and LLM modules, whenever local adaptation can be represented by matrix-valued updates. CLAIR recovers the shared LoRA subspace and detects contaminated clients via a structured low-rank plus block-sparse decomposition. We prove exact recovery of the shared LoRA subspace in the noiseless case, stable recovery under preliminary estimation error, and consistent collaborative-set recovery under mild separation conditions. We further quantify the gain from CLAIR refinement: it reduces off-subspace estimation error through cross-client averaging while preserving client-specific variation within the shared LoRA subspace, thus improves over local fine-tuning whenever this oracle gain outweighs the costs of subspace estimation and benign-client heterogeneity. Empirically, we demonstrate the benefits of CLAIR by fine-tuning a Transformer architecture on a text-copying task. The results show accurate contamination detection and improved benign-client performance compared with local fine-tuning and non-robust federated averaging.
Quantifying Hyperparameter Transfer and the Importance of Embedding Layer Learning Rate
Kalra, Dayal Singh, Barkeshli, Maissam
Hyperparameter transfer allows extrapolating optimal optimization hyperparameters from small to large scales, making it critical for training large language models (LLMs). This is done either by fitting a scaling law to the hyperparameters or by a judicious choice of parameterization, such as Maximal Update ($μ$P), that renders optimal hyperparameters approximately scale invariant. In this paper, we first develop a framework to quantify hyperparameter transfer through three metrics: (1) the quality of the scaling law fit, (2) the robustness to extrapolation errors, and (3) the asymptotic loss penalty due to choice of parameterization. Next, we investigate through a comprehensive series of ablations why $μ$P appears to offer high-quality learning rate transfer relative to standard parameterization (SP), as existing theory is inadequate. We find that the overwhelming benefit of $μ$P relative to SP when training with AdamW arises simply from maximizing the learning rate of the embedding layer. In SP, the embedding layer learning rate acts as a bottleneck that induces training instabilities; increasing it by a factor of width to match $μ$P dramatically smooths out training while improving hyperparameter transfer. We also find that weight decay improves the scaling law fits, while, in the fixed token-per-parameter setting, it hurts the robustness of the extrapolation.
ScheduleFree+: Scaling Learning-Rate-Free & Schedule-Free Learning to Large Language Models
Schedule-Free Learning has shown promise as a practical anytime training method for machine learning, showing success across dozens of standard benchmark problems. However, strong performance for LLM training has only been demonstrated at small scales. We identify a number of fixes necessary to scale up Schedule-Free Learning to larger batch sizes and model sizes, and present a learning-rate-free and schedule-free method (ScheduleFree+) for training large language models which greatly outperforms Warmup-Stable-Decay (WSD) schedules. We also demonstrate that Schedule-Free Learning is most effective for long duration training, and at 1000 tokens per parameter, it outperforms SOTA schedules by 31%. Schedule-Free Learning provides a theoretical foundation for the use of model averaging and checkpoint merging during pretraining.
HalluWorld: A Controlled Benchmark for Hallucination via Reference World Models
Liu, Emmy, Gangal, Varun, Yu, Michael, Tao, Zhuofu, Singh, Karan, Kumar, Sachin, Feng, Steven Y.
Hallucination remains a central failure mode of large language models, but existing benchmarks operationalize it inconsistently across tasks such as summarization, question answering, retrieval-augmented generation, and agentic interaction. This fragmentation makes it unclear whether a mitigation that works in one setting actually reduces hallucinations across contexts. Current hallucination benchmarks either require human annotation and fixed references that may eventually be memorized, or rely on naturalistic observations often recorded in settings that are difficult to reproduce or test systematically. To enable further research on the root causes of hallucination, we introduce HALLUWORLD, an extensible benchmark framework grounded in an explicit reference-world formulation: a model hallucinates when it produces an observable claim that is false with respect to this reference world. Building on this view, we construct a family of synthetic and semi-synthetic benchmark environments in which the reference world is fully specified, the model's observable view is controlled, and hallucination labels can be generated automatically by construction. HALLUWORLD spans multiple settings that are classically representative for AI, i.e., gridworlds, chess, and realistic terminal tasks. This enables controlled variation of key factors such as world complexity, observability, temporal change, and source-conflict policy, allowing us to disentangle hallucinations into more fine-grained error categories. We evaluate frontier and open-weight language models across these settings and find consistent patterns across domains: perceptual hallucination on directly observed information is near-solved for frontier models, while multi-step state tracking and causal forward simulation are still difficult for frontier models, and are not generally solved by extended thinking.
Google's Android XR smart glasses hope to succeed where AI-first wearables have failed
Gear Wearables Google's Android XR smart glasses hope to succeed where AI-first wearables have failed The audio-only frames pair with Android and iOS so a Gemini agent can run errands on your phone while you stay heads-up. More information Adding us as a Preferred Source in Google by using this link indicates that you would like to see more of our content in Google News results. We may earn revenue from the products available on this page and participate in affiliate programs. Google put AI on people's faces more than a decade ago with its Google Glass wearable. It was designed to put a computer directly on your face, but the world (and to some extent, the hardware) wasn't quite ready for that yet.