Goto

Collaborating Authors

 Genre


Natural Policy Gradient as Doubly Smoothed Policy Iteration: A Bellman-Operator Framework

arXiv.org Machine Learning

In this work, we show that natural policy gradient, a core algorithm in reinforcement learning, admits an exact formulation as a smoothed and averaged form of policy iteration. Specifically, we introduce doubly smoothed policy iteration (DSPI), a Bellman-operator framework in which each policy is obtained by applying a regularized greedy step to a weighted average of past $Q$-functions. DSPI includes policy iteration, dual-averaged policy iteration, natural policy gradient, and more general policy dual averaging methods as special cases. Using only monotonicity and contraction of smoothed Bellman operators, we prove distribution-free global geometric convergence of DSPI. Consequently, standard natural policy gradient and policy dual averaging achieve an iteration complexity of $\mathcal{O}((1-γ)^{-1}\log((1-γ)^{-1}ε^{-1}))$ for computing an $ε$-optimal policy, without modifying the MDP, adding regularization beyond the mirror map inherent in the update, or using adaptive, trajectory-dependent stepsizes. For the unregularized greedy case, corresponding to dual-averaged policy iteration, we also prove finite termination. The same Bellman-operator framework further extends to discounted MDPs with linear function approximation and stochastic shortest path problems.


Price of Quality: Sufficient Conditions for Sparse Recovery using Mixed-Quality Data

arXiv.org Machine Learning

We study sparse recovery when observations come from mixed-quality sources: a small collection of high-quality measurements with small noise variance and a larger collection of lower-quality measurements with higher variance. For this heterogeneous-noise setting, we establish sample-size conditions for information-theoretic and algorithmic recovery. On the information-theoretic side, we show that it is sufficient for $(n_1, n_2)$ to satisfy a linear trade-off defining the Price of Quality: the number of low-quality samples needed to replace one high-quality sample. In the agnostic setting, where the decoder is completely agnostic to the quality of the data, it is uniformly bounded, and in particular one high-quality sample is never worth more than two low-quality samples for this sufficient condition to hold. In the informed setting, where the decoder is informed of per-sample variances, the price of quality can grow arbitrarily large. On the algorithmic side, we analyze the LASSO in the agnostic setting and show that the recovery threshold matches the homogeneous-noise case and only depends on the average noise level, revealing a striking robustness of computational recovery to data heterogeneity. Together, these results give the first conditions for sparse recovery with mixed-quality data and expose a fundamental difference between how the information-theoretic and algorithmic thresholds adapt to changes in data quality.


What should post-training optimize? A test-time scaling law perspective

arXiv.org Machine Learning

Large language models are increasingly deployed with test-time strategies: sample $N$ responses, score them with a reward model or verifier, and return the best. This deployment rule exposes a mismatch in post-training: standard objectives optimize the mean reward of a single response, whereas best-of-$N$ performance is governed by the upper tail of the reward distribution. Recent test-time-aware objectives partly address this mismatch, but typically assume that training can use the same per-prompt rollout budget as deployment, which is impractical when post-training must cover many prompts while deployment can allocate much larger per-prompt test-time compute. We study this budget-mismatch regime, where only $m\ll N$ per-prompt rollouts are available during training but the target objective is best-of-$N$ deployment. Under structural assumptions on the reward tails, we show that the policy gradient of the best-of-$N$ objective can be approximated from a much smaller rollout group by extrapolating upper-tail statistics. This yields a family of Tail-Extrapolated estimators for best-of-$N$-oriented post-training: a simple direct estimator, Tail-Extrapolated Advantage (TEA), and a fixed-order debiased Prefix-TEA estimator based on moment cancellation. Experiments on instruction-following tasks show that TEA and Prefix-TEA improve best-of-$N$ performance across different language models, reward models and datasets under various training and test-time budget settings.


Factual recall in linear associative memories: sharp asymptotics and mechanistic insights

arXiv.org Machine Learning

Large language models demonstrate remarkable ability in factual recall, yet the fundamental limits of storing and retrieving input--output associations with neural networks remain unclear. We study these limits in a minimal setting: a linear associative memory that maps $p$ input embeddings in $\mathbb{R}^d$ to their corresponding~$d$-dimensional targets via a single layer, requiring each mapped input to be well separated from all other targets. Unlike in supervised classification, this strict separation induces~$p$ constraints per association and produces strong correlations between constraints that make a direct characterisation of the storage capacity difficult. Here, we provide a precise characterisation of this capacity in the following way. We first introduce a decoupled model in which each input has its own independent set of competing outputs, and provide numerical and analytical evidence that this decoupled model is equivalent to the original model in terms of storage capacity, spectra of the learnt weights, and storage mechanism. Using tools from statistical physics, we show that the decoupled model can store up to $p_c \log p_c / d^2 = 1 / 2$ associations, and generalise the computation of $p_c$ to linear two-layer architectures. Our analysis also gives mechanistic insight into how the optimal solution improves over a naïve Hebbian learning rule: rather than boosting input-output alignments with broad fluctuations, the optimal solution raises the correct scores just above the extreme-value threshold set by the competing outputs. These findings give a sharp statistical-physics characterisation of factual storage in linear networks and provide a baseline for understanding the memory capacity of more realistic neural architectures.


Testing for 'Bad Cholesterol' Doesn't Tell the Whole Story

WIRED

Testing for'Bad Cholesterol' Doesn't Tell the Whole Story So why don't more doctors use it? For decades, assessing cholesterol risk has been built around a simple idea: Lower "bad" cholesterol, lower your chance of a heart attack . The test at the center of that approach measures how much low-density lipoprotein, or LDL cholesterol, is circulating in part of the blood. It has shaped everything from clinical guidelines to the widespread use of statins, medications that reduce LDL. Lowering LDL cholesterol reduces heart attacks, strokes, and early death.


Why coffee tastes bitter, according to molecular biology

Popular Science

More information Adding us as a Preferred Source in Google by using this link indicates that you would like to see more of our content in Google News results. There are 26 different bitter receptors in the human body. Breakthroughs, discoveries, and DIY tips sent six days a week. Regular coffee drinkers know there is a big difference between a brew's aroma and its taste. A cup may smell warm and full-bodied only to leave you with a lingering bitterness behind the first sip.


Three things in AI to watch, according to a Nobel-winning economist

MIT Technology Review

Daron Acemoglu is more cautious than most about predictions of a jobs apocalypse. A few months before he was awarded the Nobel Prize in economics in 2024, Daron Acemoglu published a paper that earned him few fans in Silicon Valley. Contrary to what Big Tech CEOs had been promising--an overhaul of all white-collar work--Acemoglu estimated that AI would give only a small boost to US productivity and would not obviate the need for human work. It's okay at automating certain tasks, he wrote, but some jobs will be perfectly fine. Two years later, Acemoglu's measured take has not caught on. Chatter about an AI jobs apocalypse pops up everywhere from Senator Bernie Sanders's rallies to conversations I overhear in line at the grocery store.


Contagious yawning begins in the WOMB, experts reveal - as foetuses are seen copying their mothers' mouth movements

Daily Mail - Science & tech

There's nothing quite as contagious as a yawn – and it turns out even babies in the womb aren't immune. Experts have discovered foetuses'catch' yawns from their mothers and have been seen slowly opening and closing their mouths. As part of a study, they recorded the facial expressions of pregnant women while an ultrasound machine captured real-time images of their foetuses' faces. By comparing the two records, the researchers found that foetuses were more likely to yawn after their mothers did, with a delay of around 90 seconds. They said yawning may change the mother's breathing, chest pressure and diaphragm movements, which could provide physical cues the foetus detects.


SoftBank plans to make large-scale batteries for AI data centers

The Japan Times

SoftBank will partner with South Korea's Cosmos Lab and DeltaX to enable mass production of large-scale battery cells from the fiscal year starting next April. SoftBank Group's mobile unit said it plans to begin large-scale battery cell manufacturing at its plant in Sakai, Osaka Prefecture, to address growing power demand for AI services. SoftBank Corp. will partner with South Korea's Cosmos Lab and DeltaX to enable mass production from the fiscal year starting next April, the company said in a statement Monday. The aim is to output energy storage systems at a scale of one gigawatt-hour per year, SoftBank said, which would make it one of the largest facilities in Japan, according to data from BloombergNEF. SoftBank could scale up to a capacity of several GWh, Bloomberg reported last month.


One-Shot Generative Flows: Existence and Obstructions

arXiv.org Machine Learning

We study dynamic measure transport for generative modeling, focusing on transport maps that connect a source measure $P_0$ to a target measure $P_1$ by integrating a velocity field of the form $v_t(x) = \mathbb{E}[\dot X_t \mid X_t = x]$, where $X_\bullet = (X_t)_t$ is a stochastic process satisfying $(X_0,X_1)\sim{P_0}\otimes{P_1}$ and $\dot X_t$ is its time derivative. We investigate when $X_\bullet$ induces a \emph{straight-line flow}: a flow whose pointwise acceleration vanishes and is therefore exactly integrable by any first-order method. First, we develop multiple characterizations of straight-line flows in terms of PDEs involving the conditional statistics of the process. Then, we prove that straight-line flows under endpoint independence exhibit a sharp dichotomy. On the one hand, we construct explicit, computable straight-line processes for arbitrary Gaussian endpoints. On the other hand, we show that straight-line processes do not exist for targets with sufficiently well-separated modes. We demonstrate this obstruction through a sequence of increasingly general impossibility theorems that uncover a fundamental relationship between the sample-path behavior of a process with independent endpoints and the space-time geometry of this process' flow map. Taken together, these results provide a structural theory of when straight-line generative flows can, and cannot, exist.