Large Language Model
Elon Musk said control of OpenAI should go to his children, Sam Altman tells jury
Elon Musk tried to take control of OpenAI, even suggesting it could pass to his children when he dies, Sam Altman said on Tuesday. Altman is co-founder and chief executive of the artificial intelligence (AI) company behind ChatGPT. He is being sued by Musk, who accuses him of having looted a charity given OpenAI began as a non-profit. Appearing before a federal jury in Oakland, California, Altman said Musk not only backed the idea of OpenAI becoming a for-profit business, he wanted control of it for the long-run. A particularly hair-raising moment was when my cofounders asked, 'If you have control, what happens when you die?'
AI voice chat sucks. This startup thinks it's cracked it
PCWorld reports that Thinking Machines, founded by ex-OpenAI executive Mira Murati, has developed new AI voice interaction models that enable real-time conversations with interruptions and visual cue recognition. The technology uses a dual-AI system with a fast interaction model and background model for complex tasks, employing a multi-stream, micro-turn approach. This advancement could transform AI voice chat from current CB radio-style turn-taking into natural human-like conversations, though the technology remains in research phase. Voice chatting with today's AI can feel as stilted as an old-school CB radio exchange, where you're forced to take turns as you talk. "Hey ChatGPT, let's talk about the movies!
ChatGPT is 20/month, but one AI platform gives you GPT, Claude, and Gemini for a year for 30
When you purchase through links in our articles, we may earn a small commission. You can get access to ChatGPT, Claude, and Gemini through ChatOn AI Assistant for just $30. Juggling AI subscriptions can get expensive fast. A single AI subscription can cost hundreds per year, and using multiple tools only drives the price higher. That's part of why ChatOn AI Assistant has been gaining attention recently.
Daybreak is OpenAI's response to Anthropic's Claude Mythos
OpenAI has just launched Daybreak, a cybersecurity initiative that's clearly the company's competitor to Anthropic's Project Glasswing . If you'll recall, Glasswing uses Anthropic's unreleased AI model, Claude Mythos Preview, to provide its clients' cyber defense needs. It's been promising, so far: Mozilla revealed in April that Mythos helped it find and patch 271 vulnerabilities in the latest release of the Firefox browser. OpenAI says Daybreak uses its various AI models, including its specialized security agent Codex. In its announcement, the company explained that Daybreak is built around the premise that cyber defense should be built into software from the start and not just revolve around finding and fixing vulnerabilities.
Generative Synthetic Data for Causal Inference: Pitfalls, Remedies, and Opportunities
Synthetic tabular data are often evaluated by distributional similarity, privacy distance, or train-on-synthetic-test-on-real predictive performance, but these criteria do not ensure validity for causal inference. We show that fully generative tabular synthesizers, including GAN- and LLM-based models, can preserve predictive utility while distorting average treatment effect (ATE) estimates. The failure is structural: ATE preservation requires both a realistic covariate law and an accurate treatment-effect contrast, whereas prediction loss penalizes treatment-effect error only through an overlap-weighted term. We formalize this mismatch through sensitivity and loss-decomposition results, and identify an analogous decomposition in block-level next-token prediction under log loss. Motivated by the tabular causal analysis, we propose a hybrid synthetic-data framework that generates covariates while modeling treatment and outcome mechanisms separately, allowing causal-purpose treatment assignment such as randomized synthetic assignment. We evaluate this framework in three settings: ATE preservation under fully generative versus hybrid synthesis, targeted augmentation for practical positivity problems, and synthetic simulation engines for comparing OR, IPW, AIPW, and TMLE before real-data analysis. Across synthetic and ACTG experiments, hybrid synthesis improves causal fidelity relative to fully generative baselines; LLM-based hybrid synthesis is often more faithful than CTGAN for ATE preservation and finite-sample estimator benchmarking.
Addressing Performance Saturation for LLM RL via Precise Entropy Curve Control
Li, Bolian, Wang, Yifan, Ding, Yi, Lochab, Anamika, Grama, Ananth, Zhang, Ruqi
Reinforcement learning (RL) has enabled complex reasoning abilities in large language models (LLMs). However, most RL algorithms suffer from performance saturation, preventing continued gains as RL training scales. This problem can be characterized by the collapse of entropy, a key diagnostic for exploration in RL. Existing attempts focus on preventing entropy collapse through regularization or clipping. However, their resulting entropy curves often exhibit instability in the long term, which hinders performance gains. In this paper, we introduce Entrocraft, a simple rejection-sampling approach that realizes user-customized entropy schedule by biasing the advantage distributions. Entrocraft requires no objective regularization and is advantage-estimator-agnostic. Theoretically, we relate per-step entropy change to the advantage distribution under minimal assumptions. This explains the behavior of existing RL and entropy-preserving methods. Entrocraft also enables a systematic study of entropy schedules, which reveals that linear annealing, which starts high and decays to a slightly lower target, performs best. Empirically, Entrocraft addresses performance saturation, significantly improving generalization, output diversity, and long-term training. It enables a 4B model to outperform an 8B baseline, sustains improvement for up to 4x longer before plateauing, and raises pass@K by 50% over the baseline.
Asymptotically Log-Optimal Bayes-Assisted Confidence Sequences for Bounded Means
Kilian, Valentin, Cortinovis, Stefano, Caron, Franรงois
Confidence sequences based on test martingales provide time-uniform uncertainty quantification for the mean of bounded IID observations without parametric distributional assumptions. Their practical efficiency, however, depends strongly on the choice of martingale updates, and many existing constructions do not exploit prior information about plausible data-generating distributions or mean values. We propose a Bayes-assisted framework that uses a Bayesian working predictive model to adaptively construct confidence sequences. For each candidate mean and time point, the predictive distribution selects, among valid one-step martingale factors, the update maximising predictive expected log-growth; validity is therefore preserved even when the prior or working model is misspecified. We prove that if the predictive distribution is Wasserstein-consistent, the resulting procedure is asymptotically log-optimal, matching the per-sample log-growth of an oracle procedure with access to the true distribution. We instantiate the framework using robust predictives based on Dirichlet-process mixtures and Bayesian exponentially tilted empirical likelihood. Experiments on synthetic data, sequential best-arm identification for LLM evaluation, and prediction-powered inference show that informative priors can substantially reduce confidence-sequence width and sampling effort while retaining anytime-valid coverage.
A Semantic-Sampling Framework for Evaluating Calibration in Open-Ended Question Answering
Wang, Zhanliang, Xiao, Jiancong, Jin, Ruochen, Yang, Shu, Hou, Bojian, Shen, Li
Calibration measures whether a model's predicted confidence aligns with its empirical accuracy, and is central to the reliable deployment of large language models (LLMs) in high-stakes domains such as medicine and law. While much recent work focuses on improving LLM calibration, the equally important question of how to evaluate it in realistic settings remains underdeveloped. Open-ended question answering (QA), the most common deployment setting for modern LLMs, is where existing evaluation methods fall short: logit-based metrics need restricted output formats and internal probabilities; verbalized confidence is self-reported and often overconfident; and sampling-based methods rely on task-specific extraction rules without a clear finite-sample target. We introduce Sem-ECE (Semantic-Sampling Expected Calibration Error), a calibration evaluation framework for open-ended QA that samples answers from the model, groups them into semantic classes, and uses the resulting frequencies as confidence. We study two estimators within this framework: Sem$_1$-ECE, the same-sample self-consistency score, and Sem$_2$-ECE, a held-out variant that separates answer selection from confidence evaluation. We prove both are asymptotically unbiased, and further show that they agree on easy questions but diverge on hard ones with Sem$_2$ achieving strictly smaller calibration error, so their gap also serves as a diagnostic for question difficulty. Experiments on three open-ended QA benchmarks across five leading commercial LLMs match our theoretical predictions and show that Sem-ECE outperforms verbalized confidence and existing sampling-based methods, while complementing logit-based evaluation when internal probabilities are unavailable.
Sink vs. diagonal patterns as mechanisms for attention switch and oversmoothing prevention
Sรบkenรญk, Peter, Amado, Cristina Lรณpez, Lampert, Christoph H., Mondelli, Marco
This paper studies the role of sinks and diagonal patterns as attention switch and anti-oversmoothing mechanisms. We analyze geometric conditions under which sinks can be represented, showing a necessary alignment between the embedding of the sink and all other embeddings. Next, we refine the current understanding of the role of sinks in oversmoothing prevention: we specify the conditions under which dense attention provably smooths more than sparse attention, and empirically verify that such conditions are often satisfied in practice. We further prove an equivalence between sinks and hard attention switch, in which the output of the attention is identically 0. Finally, we relax the hard attention switch by allowing token self-communication: we provide a quantitative comparison of the costs of representing sinks vs.\ diagonal patterns, showing why sinks are favored in pretrained transformers. The introduction and analysis of diagonal patterns and the generalization of the attention switch close the gap between what oversmoothing prevention requires and what sinks provide, while also establishing when and why attention layers act like MLPs if token communication is not necessary.
CoDistill-GRPO: A Co-Distillation Recipe for Efficient Group Relative Policy Optimization
Kwon, Soo Min, Sun, Ziteng, Suresh, Ananda Theertha, Jain, Himanshu, Kumar, Sanjiv
Group Relative Policy Optimization (GRPO) has emerged as a powerful algorithm for improving the reasoning capabilities of language models, but often fails to improve small models due to sparse rewards on difficult tasks. Existing works mitigate this issue by leveraging a larger model, either to provide hints for rollouts or to provide dense reward signals through knowledge distillation (KD). However, this assumes the existence of such an oracle, and training one can significantly increase total training time. In this work, we propose CoDistill-GRPO, a co-distillation algorithm that simultaneously trains a large and a small model by maximizing carefully designed GRPO objectives. The two models learn from each other: the small model uses an on-policy KD reward to learn from the large model's distribution, while the large model is updated using rollouts generated by the small model with importance reweighting, reducing the computational overhead of rollout generation. We show that CoDistill-GRPO substantially improves small model performance over standard GRPO on mathematical benchmarks across both Qwen and Llama models. Specifically, with Qwen2.5-Math-1.5B, we observe an accuracy increase of over 11.6 percentage points over the base model and an additional 6.0 percentage points over GRPO on the Minerva dataset. Interestingly, the larger model (Qwen2.5-Math-7B) trained with CoDistill-GRPO nearly matches standard GRPO performance despite training on small-model rollouts. This highlights CoDistill-GRPO as a cost-effective alternative to GRPO for larger models, yielding an approximate 18% speedup, which may be of independent interest.