Goto

Collaborating Authors

 Education


When Is Next-Token Prediction Useful? Marginalization, Ergodicity, Mixture Identifiability, Local Sufficiency, RAG, Tools, and Programming

arXiv.org Machine Learning

Language models trained on observed sequences are often described as learning the conditional distribution of the next token given previous tokens. This description is only conditionally correct. A model trained on realized token trajectories does not observe full conditional laws; it receives sampled continuations. Moreover, real language generation is conditioned not only on previous words but also on non-textual circumstances: facts, events, intentions, goals, beliefs, social context, and task-specific constraints. This paper distinguishes three objects that are often conflated: the full conditional language process conditioned on latent circumstances, the marginal text-only process obtained by integrating those circumstances out, and the model-induced distribution learned from finite observed corpora. The paper argues that interpreting model training as estimating the marginal text-only law requires strong assumptions of stationarity, representativeness, and ergodicity, assumptions that are standard in statistical estimation but problematic when applied to heterogeneous language corpora. Even if these assumptions hold, the marginal text-only law is useful only when the observed prefix is an approximately sufficient statistic for the latent circumstances relevant to continuation. In information-theoretic terms, usefulness requires that the residual conditional mutual information between the next token and the omitted circumstances, given the observed text, be small. The paper then extends this argument to heterogeneous training corpora. Finally, the paper interprets Retrieval Augmented Generation (RAG) and tool use as conditional sufficiency devices.


Concomitant DAG Learning: On the Roles of Noise Adaptivity, Sparsity, and Non-negativity

arXiv.org Machine Learning

Directed acyclic graphs (DAGs) constitute a central modeling tool to enable principled reasoning about cause-effect interactions in complex systems. However, since the causal structure underlying a group of variables is often unknown and interventions may be infeasible or ethically challenging to implement, there is a need to address the task of inferring DAGs from observational data. However, most classical structure identification approaches face two key obstacles: the combinatorial challenge of enforcing acyclicity, which severely limits scalability, and identifiability challenges arising from latent confounding or heterogeneous noise. This tutorial offers an overview of recent signal processing and optimization advances that address these issues by recasting DAG structure learning as a continuous, score-based estimation problem over adjacency matrices. We begin with a didactic introduction to structural equation models and the formulation of causal graph recovery, followed by a historical survey of score-based methods ranging from early combinatorial search schemes and greedy heuristics to modern continuous frameworks that leverage smooth characterizations of acyclicity. Building on this foundation, we describe concomitant DAG estimation methods that jointly infer sparse causal structure and exogenous noise levels, improving robustness under heteroscedasticity and distribution shifts by rendering the estimator noise adaptive. All in all, the tutorial introduces readers to challenges and opportunities for signal processing research at the crossroads of causal inference, high-dimensional statistics, and scalable graph learning, while outlining emerging directions including online, nonlinear, and neural causal discovery.


Asymmetric Scaling Laws from Sparse Features

arXiv.org Machine Learning

We introduce a model for neural scaling laws under sparse activations. In the model, test loss is often dominated by rare coordinates that are never observed in the training input. This mechanism induces a novel bottleneck absent from dense models. We derive the asymptotic population loss in both the underparameterized and overparameterized regimes, and show that the loss exhibits a double-descent peak near the interpolation threshold -- where the number of parameters is just sufficient to fit the training data -- resulting in a loss curve governed by two distinct scaling exponents -- one for the overparameterized regime and one for the underparameterized regime -- with a gap determined by the degree of sparsity. Additionally, we derive a compute-optimal frontier that favors increasing dataset size over model capacity under fixed compute budgets. We also analyze gradient-descent dynamics and identify a scaling law for the probability that fixed-step gradient descent becomes unstable. We further show that the sparsity-induced effect persists under nonlinear activations.


Training-Free Looped Transformers

arXiv.org Machine Learning

We introduce training-free looped transformers, in which a lightweight inference-time wrapper loops a contiguous mid-stack block of layers of a frozen checkpoint without additional fine-tuning, continued training, or architectural changes. Unlike prior looped transformer methods that train with the looped structure end-to-end, we retrofit recurrence onto pretrained models at test time. We show that naive block reapplication usually degrades performance, highlighting the importance of the loop application strategy. Motivated by viewing a pre-norm transformer block as a forward Euler step on an ODE, we instead treat looping as a refinement of the same approximation, replacing one large update with smaller damped sub-steps. Across seven dense, sparse MoE, and MLA+MoE model families, our method improves Qwen3-4B-Instruct by +2.64 pp on MMLU-Pro, Qwen3-30B-A3B-Instruct by +1.14 pp on CommonsenseQA, and Moonlight-16B-A3B-Instruct by +1.20 pp on OpenBookQA.


I avoid AI tools because thinking is supposed to be hard. It's what makes us human Wendy Liu

The Guardian

I avoid AI tools because thinking is supposed to be hard. It's what makes us human Long before the age of multi-billion-dollar AI companies promising to disrupt the field of software development, I was learning to code the hard way. It was the mid-2000s, and I was a child with unmonitored access to the family computer. With the help of a basic text editor program, I learned how to make websites - first basic, then increasingly complex - from scratch. The results were never as beautiful or polished as in my imagination, but I could live with that, because I was learning a craft. The painstaking hours of debugging and poring over arcane documentation for projects that I eventually abandoned never felt wasted.


There's Never Been a Better Time to Study Computer Science

The Atlantic - Technology

There's Never Been a Better Time to Study Computer Science Even as AI progresses, coders aren't doomed. It's a weird time to be studying computer science. Recent grads have a higher unemployment rate than those in just about every other major--yes, even philosophy. The internet is littered with rants from newly minted programmers who can't find work. On one such YouTube video, the top comment reads: "Your first mistake is not being born earlier."


Selena Gomez is reportedly bringing her talents to award-winning director's new four-hour X-rated movie

FOX News

Minka Kelly uncorks a heater at 45, ABS backfires spectacularly and LSU parents vs a security guard! Robot's lifeless corpse hauled off stage after fall during disastrous Michael Jackson impression Bear cubs spar on woman's front porch in adorable viral nature video, reactions pour in Show Tiffany Stratton some respect -- a boob job doesn't mean the WWE champ is made of plastic Britney Spears stuns with a post-plea deal Instagram dance, college baseball HOT mic & is this dream normal? Landlord in a tenant's home for repairs was caught on a security camera getting it on with a woman instead Paige Spiranac continues her generational golf content influencing run in 2026, Mike Alstott is ripped & MEAT! 'World's sexiest fan' drops her World Cup anthem and here's why you never assist a bike thief Wearing only a watch, a headlamp and flip-flops isn't a great disguise when trashing a neighbor's motion light Stephen Miller: The American people rejected'third world' Democratic policies by voting for Trump Former CENTCOM commander'concerned' about Iran's residual military capabilities Wall Street titans sound alarm on Mamdani's'reckless' targeting of top employers Retired general says Iran is fighting a'war of resistance' Kevin Warsh's potential Fed chairmanship sparks economic debate on inflation Minnesota fraud mastermind sentenced to 41.5 years in prison OutKick-Culture Selena Gomez is reportedly bringing her talents to award-winning director's new four-hour X-rated movie Don't let reports that Selena Gomez is going to be starring in an X-rated movie fool you. This isn't going to be a poorly produced amateur-level movie thrown together with someone who doesn't know what they're doing. It's also not a sex tape, for the folks who can't get their act together.


Artificial Intelligence glitch at Arizona college graduation sparks uproar from crowd

FOX News

Selena Gomez is reportedly bringing her talents to award-winning director's new four-hour X-rated movie Minka Kelly uncorks a heater at 45, ABS backfires spectacularly and LSU parents vs a security guard! Robot's lifeless corpse hauled off stage after fall during disastrous Michael Jackson impression Bear cubs spar on woman's front porch in adorable viral nature video, reactions pour in Show Tiffany Stratton some respect -- a boob job doesn't mean the WWE champ is made of plastic Britney Spears stuns with a post-plea deal Instagram dance, college baseball HOT mic & is this dream normal? Landlord in a tenant's home for repairs was caught on a security camera getting it on with a woman instead Paige Spiranac continues her generational golf content influencing run in 2026, Mike Alstott is ripped & MEAT! 'World's sexiest fan' drops her World Cup anthem and here's why you never assist a bike thief Wearing only a watch, a headlamp and flip-flops isn't a great disguise when trashing a neighbor's motion light Stephen Miller: The American people rejected'third world' Democratic policies by voting for Trump Former CENTCOM commander'concerned' about Iran's residual military capabilities Wall Street titans sound alarm on Mamdani's'reckless' targeting of top employers Retired general says Iran is fighting a'war of resistance' Kevin Warsh's potential Fed chairmanship sparks economic debate on inflation Minnesota fraud mastermind sentenced to 41.5 years in prison President Tiffany Hernandez said the school was'using a new AI system as our reader' and called it'a lesson learned' Kurt Knutsson discusses growing public backlash against AI, including former Google CEO Eric Schmidt being booed at a University of Arizona commencement. He further discusses the development of artificial eggs that could revive dead species. I'll be honest with you guys, I don't know what to make of my feelings toward artificial intelligence, because my mood on the subject changes by the day.


Online Learning-to-Defer with Varying Experts

arXiv.org Machine Learning

Learning-to-Defer (L2D) methods route each query either to a predictive model or to external experts. While existing work studies this problem in batch settings, real-world deployments require handling streaming data, changing expert availability, and shifting expert distribution. We introduce the first online L2D algorithm for multiclass classification with bandit feedback and a dynamically varying pool of experts. Our method achieves regret guarantees of $O((n+n_e)T^{2/3})$ in general and $O((n+n_e)\sqrt{T})$ under a low-noise condition, where $T$ is the time horizon, $n$ is the number of labels, and $n_e$ is the number of distinct experts observed across rounds. The analysis builds on novel $\mathcal{H}$-consistency bounds for the online framework, combined with first-order methods for online convex optimization. Experiments on synthetic and real-world datasets demonstrate that our approach effectively extends standard Learning-to-Defer to settings with varying expert availability and reliability.


Spectral bandits for smooth graph functions with applications in recommender systems

arXiv.org Machine Learning

Smooth functions on graphs have wide applications in manifold and semi-supervised learning. In this paper, we study a bandit problem where the payoffs of arms are smooth on a graph. This framework is suitable for solving online learning problems that involve graphs, such as content-based recommendation. In this problem, each recommended item is a node and its expected rating is similar to its neighbors. The goal is to recommend items that have high expected ratings. We aim for the algorithms where the cumulative regret would not scale poorly with the number of nodes. In particular, we introduce the notion of an effective dimension, which is small in real-world graphs, and propose two algorithms for solving our problem that scale linearly in this dimension. Our experiments on real-world content recommendation problem show that a good estimator of user preferences for thousands of items can be learned from just tens nodes evaluations.