Agents
Transportability for Bandits with Data from Different Environments
A unifying theme in the design of intelligent agents is to efficiently optimize a policy based on what prior knowledge of the problem is available and what actions can be taken to learn more about it. Bandits are a canonical instance of this task that has been intensely studied in the literature. Most methods, however, typically rely solely on an agent's experimentation in a single environment (or multiple closely related environments). In this paper, we relax this assumption and consider the design of bandit algorithms from a combination of batch data and qualitative assumptions about the relatedness across different environments, represented in the form of causal models. In particular, we show that it is possible to exploit invariances across environments, wherever they may occur in the underlying causal model, to consistently improve learning. The resulting bandit algorithm has a sub-linear regret bound with an explicit dependency on a term that captures how informative related environments are for the task at hand; and may have substantially lower regret than experimentation-only bandit instances.
Microsoft's smarter Outlook taps AI agents to save you time
PCWorld highlights Microsoft's new agentic AI features for Outlook that go beyond basic email drafting to advanced inbox and calendar management automation. These tools can identify unreplied emails, summarize missed content, draft follow-ups, reschedule meetings, and create agendas to save significant time. Access requires a Microsoft 365 Copilot for Business account and IT approval, potentially revolutionizing productivity for business users. I never really thought I'd welcome AI as a part of my ongoing business day. But Microsoft's ongoing productivity updates to Outlook actually have me tempted. By now, drafting an email using AI is old hat, and something that I generally wouldn't do. But Microsoft has begun adding agentic AI to Outlook via its experimental "Frontier" program and it actually sounds like something that could really save time and energy.
CODA: Coordination via On-Policy Diffusion for Multi-Agent Offline Reinforcement Learning
Hedman, Marcel, Tessera, Kale-ab Abebe, Formanek, Juan Claude, Sims, Anya, Zamboni, Riccardo, McInroe, Trevor, Torr, John, Fosong, Elliot
Offline multi-agent reinforcement learning (MARL) enables policy learning from fixed datasets, but is prone to coordination failure: agents trained on static, off-policy data converge to suboptimal joint behaviours because they cannot co-adapt as their policies change. We introduce CODA (Coordination via On-Policy Diffusion for Multi-Agent Reinforcement Learning), a diffusion-based multi-agent trajectory generator for data augmentation that samples conditioned on the current joint policy, producing synthetic experience which reflects the evolving behaviours of the agents, thereby providing a mechanism for co-adaptation. We find that previous diffusion-based augmentation approaches are insufficient for fostering multi-agent coordination because they produce static augmented datasets that do not evolve as the current joint policy changes during training; CODA resolves this by more closely simulating on-policy learning and is a meaningful step toward coordinated behaviours in the offline setting. CODA is algorithm-agnostic and can be layered onto both model-free and model-based offline reinforcement learning pipelines as an augmentation module. Empirically, CODA not only resolves canonical coordination pathologies in continuous polynomial games but also delivers strong results on the more complex MaMuJoCo continuous-control benchmarks.
Maryna Viazovska's proofs of sphere packing formalized with AI
The proofs that earned EPFL professor Maryna Viazovska the Fields Medal in 2022 have reached a new milestone: their complete formalization by computer, achieved through a collaboration between mathematicians and artificial intelligence tools. In 2016, Maryna Viazovska solved the sphere packing problem in dimension 8, proving that the E lattice constitutes the densest possible arrangement. Shortly after, together with collaborators, she established an analogous result in dimension 24 using the Leech lattice. Her method provided an elegant solution to a problem studied for centuries, with close ties to applied fields such as error-correcting codes. For this major contribution, Viazovska was awarded the Fields Medal in 2022, the highest distinction in mathematics.
Details
A.1 Difference between the performance of two joint policies In Section 3.1, the difference between the performance of two joint policies is expressed as follows: The proof is a multi-agent version of the proof in (Kakade and Langford, 2002). Now we provide the mathematical detail formally. A.2 Approximation that matches the true value to first order In Section 3.1, we claim that Jπ( π) matches J( π) to first order. Intuitively, this means that a sufficiently small update of the joint policy which improves Jπ( π) will also improve J( π). Now we prove it formally.