Course Syllabus & Notes
INFUSER: Influence-Guided Self-Evolution Improves Reasoning
Chen, Siyu, Lu, Miao, Wu, Beining, Sheen, Heejune, Zhang, Fengzhuo, Li, Shuangning, Li, Zhiyuan, Blanchet, Jose, Wang, Tianhao, Yang, Zhuoran
Self-evolution offers a scalable path to stronger reasoning: a pretrained language model improves itself with only minimal external supervision. Yet existing methods either depend on extensively curated or teacher-generated training data, or, when the generator runs unsupervised, reward it by a difficulty heuristic that need not improve the solver. We introduce INFUSER, an iterative co-training framework with two co-evolving roles: a Generator that drafts questions and reference golden answers from a pool of unstructured, automatically collected documents, and a Solver that improves by training on them. The solver is trained with standard correctness rewards against the generator-provided answers, while the generator is rewarded by an optimizer-aware influence score that measures whether each proposed question would actually improve the solver on the target distribution. Because this continuous, noisy influence score is poorly served by standard GRPO, we propose DuGRPO, a dual-normalized variant of GRPO, for generator training. Together, these turn the document pool into an adaptive curriculum that favors questions useful to the current solver, not just hard ones. On Qwen3-8B-Base, INFUSER outperforms strong self-evolution baselines with over 20% relative improvement on Olympiad and SuperGPQA benchmarks, and an 8B INFUSER co-evolving generator outperforms a frozen 32B thinking generator on math and coding. Ablations confirm each design choice is necessary, and two extensions, applying INFUSER to an instruction-finetuned anchor and augmenting it with rule-verifiable RLVR data, further demonstrate the flexibility and generalizability of the framework. Code is available at https://github.com/FFishy-git/INFUSER.
This Humanoid Robot Is a Terrifyingly Competent Office Intern
Flexion Robotics, a startup founded by ex-Nvidia engineers, has a clever way of training robots to do useful work. Humanoid robots might be able to run, dance, and occasionally kick people, but to become human, they're going to need to learn how to do all sorts of menial chores at work. Flexion Robotics, a Swiss startup founded by ex-Nvidia robotics researchers, thinks it has the solution. The company has developed a way to train robots to perform complex tasks that involve simple skills like opening doors, climbing stairs, and carrying boxes. The key is to teach the robots individual skills in simulation, then have a master AI algorithm determine how to use them.
Australian musicians sound warning note after Nick Cave, Kylie and many more slurped into AI training tool
Nick Cave and Kylie Minogue are among Australian artists reportedly found in datasets used to train artificial intelligence. Nick Cave and Kylie Minogue are among Australian artists reportedly found in datasets used to train artificial intelligence. 'It's all just rendered useless', Something For Kate's Paul Dempsey says as AI scrapes millions of songs to learn how to make music Paul Dempsey and Bernard Fanning are among big-name Australian musicians upset that their original songs have been found in datasets used to train artificial intelligence. A dataset search tool recently created by US publication The Atlantic reveals millions of creative works have been scraped from the internet to train the disruptive technology. It includes a vast catalogue of work by Australian artists, with tunes by Kylie Minogue, Powderfinger, Nick Cave and Jimmy Barnes, and novels by Thomas Keneally and Peter Carey.
Improving Regret Approximation for Unsupervised Dynamic Environment Generation
Unsupervised Environment Design (UED) seeks to automatically generate training curricula for reinforcement learning (RL) agents, with the goal of improving generalisation and zero-shot performance. However, designing effective curricula remains a difficult problem, particularly in settings where small subsets of environment parameterisations result in significant increases in the complexity of the required policy. Current methods struggle with a difficult credit assignment problem and rely on regret approximations that fail to identify challenging levels, both of which are compounded as the size of the environment grows. We propose Dynamic Environment Generation for UED (DEGen) to enable a denser level generator reward signal, reducing the difficulty of credit assignment and allowing for UED to scale to larger environment sizes. We also introduce a new regret approximation, Maximised Negative Advantage (MNA), as a significantly improved metric to optimise for, that better identifies more challenging levels. We show empirically that MNA outperforms current regret approximations and when combined with DEGen, consistently outperforms existing methods, especially as the size of the environment grows. We have made all our code available here: https://github.
Truthful Aggregation of LLMs with an Application to Online Advertising
The next frontier of online advertising is revenue generation from LLM-generated content. We consider a setting where advertisers aim to influence the responses of an LLM, while platforms seek to maximize advertiser value and ensure user satisfaction. The challenge is that advertisers' preferences generally conflict with those of the user, and advertisers may misreport their preferences. To address this, we introduce MOSAIC, an auction mechanism that ensures that truthful reporting is a dominant strategy for advertisers and that aligns the utility of each advertiser with their contribution to social welfare. Importantly, the mechanism operates without LLM fine-tuning or access to model weights and provably converges to the output of the optimally fine-tuned LLM as computational resources increase. Additionally, it can incorporate contextual information about advertisers, which significantly improves social welfare. Via experiments with publicly available LLMs, we show that MOSAIC leads to high advertiser value and platform revenue with low computational costs. While our motivating application is online advertising, our mechanism can be applied in any setting with monetary transfers, making it a general-purpose solution for truthfully aggregating the preferences of selfinterested agents over LLM-generated replies.
ATMOSSCI-BENCH: Evaluating the Recent Advances of Large Language Models for Atmospheric Science
The rapid advancements in large language models (LLMs), particularly in their reasoning capabilities, hold transformative potential for addressing complex challenges and boosting scientific discovery in atmospheric science. However, leveraging LLMs effectively in this domain requires a robust and comprehensive evaluation benchmark. Toward this end, we present ATMOSSCI-BENCH, a novel benchmark designed to systematically assess LLM performance across five core categories of atmospheric science problems: hydrology, atmospheric dynamics, atmospheric physics, geophysics, and physical oceanography. ATMOSSCI-BENCH features a dual-format design comprising both multiple-choice questions (MCQs) and open-ended questions (OEQs), enabling scalable automated evaluation alongside deeper analysis of conceptual understanding. We employ a template-based MCQ generation framework to create diverse, graduate-level problems with symbolic perturbation, while OEQs are used to probe open-ended reasoning. We conduct a comprehensive evaluation of representative LLMs, categorized into four groups: instruction-tuned models, advanced reasoning models, math-augmented models, and domain-specific climate models. Our analysis provides some interesting insights into the reasoning and problem-solving capabilities of LLMs in atmospheric science. We believe ATMOSSCI-BENCH can serve as a critical step toward advancing LLM applications in climate services by offering a standard and rigorous evaluation framework.
Do esports students do more than play games?
Do esports students do more than play games? Playing video games in college may seem unusual, but for many teenagers across the country, it could lead to professional careers. Students at Central Bedfordshire College have just finished their first year of the Level 3 Pearson BTEC in esports, the first time the college has offered such a course. While gaming is a key part of the learning, students also study a broad range of modules designed to prepare them for work both inside and outside competitive gaming. These can include psychology to understand how the brain reacts under pressure, alongside nutrition and fitness to ensure they have the energy to compete effectively.
Bandit Guided Submodular Curriculum for Adaptive Subset Selection
Traditional curriculum learning proceeds from easy to hard samples, yet defining a reliable notion of difficulty remains elusive. Prior work has used submodular functions to induce difficulty scores in curriculum learning. We reinterpret adaptive subset selection and formulate it as a multi-armed bandit problem, where each arm corresponds to a submodular function guiding sample selection. We introduce ONLINESUBMOD, a novel online greedy policy that optimizes a utility-driven reward and provably achieves no-regret performance under various sampling regimes. Empirically, ONLINESUBMOD outperforms both traditional curriculum learning and bi-level optimization approaches across vision and language datasets, showing superior accuracy-efficiency tradeoffs. More broadly, we show that validationdriven reward metrics offer a principled way to guide the curriculum schedule. Our code is publicly available at GitHub 2.
Lessons Learned: AMulti-Agent Framework for Code LLMs to Learn and Improve
Recent studies show that LLMs possess different skills and specialize in different tasks. In fact, we observe that their varied performance occur in several levels of granularity. For example, in the code optimization task, code LLMs excel at different optimization categories and no one dominates others. This observation prompts the question of how one leverages multiple LLM agents to solve a coding problem without knowing their complementary strengths a priori. We argue that a team of agents can learn from each other's successes and failures so as to improve their own performance. Thus, a lesson is the knowledge produced by an agent and passed on to other agents in the collective solution process.
Pump.Fun's Bounties Platform Is a Black Hole of Circular Grifting
Pump.Fun's Bounties Platform Is a Black Hole of Circular Grifting The crypto platform claims you can "pay anyone to do anything," from quitting a job on camera to getting a memecoin-themed tattoo. But it mostly seems like people trying to scam each other. Would you run into a crowded university lecture hall, fart into a megaphone, and bellow "fartcoin" at the top of your lungs? If so--and should you have the means to document this stunt on video, preferably capturing the audience's reaction--you may claim a reward of approximately $1,000 . The money, of course, will be dispensed in fartcoin, a meme cryptocurrency trading at a little over 10 cents at time of publication, with a total market capitalization hovering around $130 million. Such is the promise of Pump.Fun GO, a new feature on Pump.Fun, one of the fastest-growing crypto businesses of the past few years.