penalty
Adaptive Latent-Space Constraints in Personalized Federated Learning
Federated learning (FL) is an effective and widely used approach to training deep learning models on decentralized datasets held by distinct clients. FL also strengthens both security and privacy protections for training data. Common challenges associated with statistical heterogeneity between distributed datasets have spurred significant interest in personalized FL (pFL) methods, where models combine aspects of global learning with local modeling specific to each client's unique characteristics. This work investigates the efficacy of theoretically supported, adaptive MMD measures in pFL, primarily focusing on the Ditto framework, a state-ofthe-art technique for distributed data heterogeneity. The use of such measures significantly improves model performance across a variety of tasks, especially those with pronounced feature heterogeneity. Additional experiments demonstrate that such measures are directly applicable to other pFL techniques and yield similar improvements across a number of datasets. Finally, the results motivate the use of constraints tailored to the various kinds of heterogeneity expected in FL systems.
Large language models can learn and generalize steganographic chain-of-thought under process supervision
Chain-of-thought (CoT) reasoning not only enhances large language model performance but also provides critical insights into decision-making processes, marking it as a useful tool for monitoring model intent and planning. However, recent works have shown that banning the mention of a specific example of reward hacking causes obfuscation of the undesired reasoning traces but the persistence of the undesired behavior, threatening the reliability of CoT monitoring. We provide an extension to these results with regard to the ability of models to learn a specific type of obfuscated reasoning: steganography. First, we show that penalizing the use of specific strings within load-bearing reasoning traces causes models to substitute alternative strings. Crucially, this does not alter the underlying method by which the model performs the task, demonstrating that the model can learn to steganographically encode its reasoning. We further demonstrate that models can generalize an encoding scheme. When the penalized strings belong to an overarching class, the model learns not only to substitute strings seen in training, but also develops a general encoding scheme for all members of the class which it can apply to held-out testing strings.
The Overthinker's DIET: Cutting Token Calories with DIfficulty-AwarETraining
Recent large language models (LLMs) exhibit impressive reasoning but often overthink, generating excessively long responses that hinder efficiency. We introduce DIET (DIfficulty-AwarETraining), a framework that systematically cuts these "token calories" by integrating on-the-fly problem difficulty into the reinforcement learning (RL) process. DIETdynamically adapts token compression strategies by modulating token penalty strength and conditioning target lengths on estimated task difficulty, to optimize the performance-efficiency trade-off. We also theoretically analyze the pitfalls of naive reward weighting in group-normalized RL algorithms like GRPO, and propose Advantage Weighting technique, which enables stable and effective implementation of these difficulty-aware objectives. Experimental results demonstrate that DIETsignificantly reduces token counts while simultaneously improving reasoning performance. Beyond raw token reduction, we show two crucial benefits largely overlooked by prior work: (1) DIET leads to superior inference scaling. By maintaining high per-sample quality with fewer tokens, it enables better scaling performance via majority voting with more samples under fixed computational budgets, an area where other methods falter.
05057404e0cab4fe58971dc3a7d6044c-Supplemental-Datasets_and_Benchmarks_Track.pdf
The authors would like to thank Ulrich-Michael, Frances, James, Maryam, and Mandolyn for their help in labeling the dataset. The work at the Université de Montréal was supported by the Natural Sciences and Engineering Research Council of Canada (NSERC) (Paull), an NSERCPGS DScholarship (Morin) and an FRQNT Doctoral Scholarship (Morin). Moreover, this research was enabled in part by compute resources provided by Mila (mila.quebec). The work at the University of Freiburg was funded by an academic grant from NVIDIA. The work at the University of Oxford was supported by a Royal Society University Research Fellowship (Fallon, Kassab), a Sellafield Robotics and AICentre of Excellence Grant, and EPSRCC2CGrant EP/Z531212/1 (Mattamala), and the National Research Foundation of Korea (NRF) grant funded by the Korea government (MSIT)(No.
NBA needs to incorporate 'mistaken identity' rule from FIFA World Cup to stop the flopping issue
NBA Finals ratings surge as the league welcomes Trump, drops woke messaging -- but is it sustainable? Netflix film chief says they won't work with directors who want to release movies in theaters Disney's Star Wars relaunch crumbles as'Mandalorian and Grogu' crashes at the box office Education Secretary Linda McMahon rips California trans athlete'compromise,' tells Newsom to'pick a side' Jimmy Kimmel says he felt'defeated' after Colbert show was cancelled, says CBS is using'made-up numbers' Here's how the CDC tried to use bad science to convince people to wear masks during COVID'The Mandalorian and Grogu' is a prime example that Disney's Star Wars is on life support'Supergirl' pre-release tracking looks disastrously bad for Hollywood after lead actress' bizarre comments Trump praised for having'lots of energy' ahead of 80th birthday Trump calls Maine Democratic Senate candidate Graham Platner a'thug' Charter Space founder responds to critics' worries about SpaceX impact on market Rep. Byron Donalds shares his faith redemption story amid Florida gubernatorial run Iran's foreign minister says peace with US'has never been closer' GOP lawmaker says it's'really important' that US continues cartel crackdown Spencer Pratt's use of AI to boost campaign sparks debate FBI arrests first suspect on'most wanted fraudsters' list Accused Charlie Kirk killer's attorneys seek to BLOCK death penalty Kayleigh McEnany: Capitalism isn't the big evil Bernie Sanders would have you believe OutKick Analysis NBA needs to incorporate'mistaken identity' rule from FIFA World Cup to stop the flopping issue The World Cup's use of the rule offers a blueprint for real-time consequences INSTANT REACTION FIFA World Cup Now reacts to USA's 4-1 dominant win over Paraguay Melissa Ortiz, Peter Crouch, Sacha Kljestan, Bob Bradley, Stu Holden, Brad Guzan and Mo Edu react to USA's 4-1 win over Paraguay. Flopping is a major issue in the NBA. I've written about it ad nauseam. The league has anti-flopping measures in place, but they rarely dish out fines based on reviews after the conclusion of the games, and in-game flopping calls are even more of rarity.
Cameras, Sensors, and 3D Body Scans: All the Tech Helping Eliminate Blown Calls
Soccer officials already rely on cameras to see who's offside and who sent the ball out of bounds. But during this World Cup, refs will use digital twins of each player to view plays from every angle. At the 2026 World Cup, the refs on the field and the officials on the sidelines will be able to use an abundance of tech to help call penalties, spot offside violations, and make other consequential decisions. The video assistant referee system, known as VAR, and the semi-automated offside technology (SAOT) have been used in soccer for years. But the setup at this summer's World Cup represents some of the most advanced uses of adjudication tech to date--not just in soccer, but across all high-level sports.
Is the acquisition worth the cost? Surrogate losses for Consistent Two-stage Classifiers
Recent years have witnessed the emergence of a spectrum of foundation models, covering a broad range of capabilities and costs. Often, we effectively use foundation models as feature generators and train classifiers that use the outputs of these models to make decisions. In this paper, we consider an increasingly relevant setting where we have two classifier stages. The first stage has access to features $x$ and has the option to make a classification decision or defer, while incurring a cost, to a second classifier that has access to features $x$ and $z$. This is similar to the ``learning to defer'' setting, with the important difference that we train both classifiers jointly, and the second classifier has access to more information. The natural loss for this setting is an $\ell_{01c}$ loss, where a penalty is paid for incorrect classification, as in $\ell_{01}$, but an additional penalty $c$ is paid for consulting the second classifier. The $\ell_{01c}$ loss is unwieldy for training. Our primary contribution in this paper is the derivation of a hinge-based surrogate loss $\ell^c_{hinge}$ that is much more amenable to training but also satisfies the property that $\ell^c_{hinge}$-consistency implies $\ell_{01c}$-consistency.
Geometry of Relaxed Fair Regression: A Unified Framework for Aware and Unaware Settings
Lince, M. Generali, Divol, V., Flamary, R., Gaucher, S., Loiseau, P.
Fairness-accuracy trade-offs are a central concern in the deployment of fairness-aware machine learning methods. When sensitive attributes are unavailable at inference time-the so called unawareness setting, principled methods for obtaining accurate predictions under relaxed fairness constraints are largely missing. In this work, we address this gap by formulating regression under a demographic parity penalty as an optimal transport problem. Our framework unifies both the \emph{aware} and \emph{unaware} settings and characterizes optimal prediction functions via optimal transport maps, under both squared Wasserstein-2 and Total Variation penalties. These results reveal that the choice of penalty reflects fundamentally different fairness philosophies: the Wasserstein penalty induces a smooth, population-wide compromise, while Total Variation enforces exact parity for a subset of individuals. Building on these theoretical characterizations, we propose an algorithm that is simple to implement, computationally efficient, and consistently matches or outperforms state-of-the-art baselines on real-world benchmarks.
Causal Representation Learning for Generalisable Recommendation
Felekis, Yorgos, O'Riordan, Michael, Corcoll, Oriol, Gilligan-Lee, Ciarán M.
Predictive models trained on observational data often fail to generalise to the distributions they encounter when deployed, especially when the training data is a product of the system being optimised. Recommender systems are a canonical example: they are trained on interaction logs confounded by the deployed policy, past user behaviour, and platform filtering. As a result, the training distribution differs substantially from the candidate distribution scored at serving time, a gap that makes offline metrics unreliable predictors of online performance. We address the distribution shift problem with a method motivated by causal representation learning (CRL). We propose an information-theoretic disentanglement criterion and prove that its optimum depends only on the causal components of the input. We then derive a tractable variational lower bound that makes the criterion optimisable from finite observational data alone. The scope of our method is narrower than that of much of the CRL literature, in that we target better generalisation under distribution shift, not full identification of all latent causal factors. This narrower target is what makes the method practical, requiring only the existing confounded logs, applying to any standard supervised model, and adding no inference-time cost. Our headline evaluation is an A/B test with millions of users on Spotify, applied to a production ranker for personalised playlist generation. A capacity-matched CRL variant performed on par offline but delivered substantial online gains in listener engagement. Complementary evidence on the public KuaiRand recommendation dataset and a synthetic benchmark with known causal structure shows the same pattern: offline parity with baseline, gains under distribution shift. Across all three settings, adding our causal disentanglement objective yields meaningfully better out-of-distribution generalisation.