Education
Beyond Accuracy: Dissecting Mathematical Reasoning for LLMs Under Reinforcement Learning
Wang, Jiayu, Ming, Yifei, Ke, Zixuan, Xiong, Caiming, Joty, Shafiq, Albarghouthi, Aws, Sala, Frederic
Reinforcement learning (RL) has become the dominant paradigm for improving the performance of language models on complex reasoning tasks. Despite the substantial empirical gains demonstrated by RL-based training methods like GRPO, a granular understanding of why and how RL enhances performance is still lacking. To bridge this gap, we introduce SPARKLE, a fine-grained analytic framework to dissect the effects of RL across three key dimensions: (1) plan following and execution, (2) knowledge integration, and (3) chain of subproblems. Using this framework, we gain insights beyond mere accuracy. For instance, providing models with explicit human-crafted, step-by-step plans can surprisingly degrade performance on the most challenging benchmarks, yet RL-tuned models exhibit greater robustness, experiencing markedly smaller performance drops than base or SFT models. This suggests that RL may not primarily enhance the execution of external plans but rather empower models to formulate and follow internal strategies better suited to their reasoning processes. Conversely, we observe that RL enhances models' ability to integrate provided knowledge into their reasoning process, yielding consistent gains across diverse tasks. Finally, we study whether difficult problems -- those yielding no RL signals and mixed-quality reasoning traces -- can still be effectively used for training. We introduce SparkleRL-PSS, a multi-stage RL pipeline that reuses hard problems with partial step scaffolding, guiding exploration effectively without additional data generation. Together, our findings provide a principled foundation for understanding how RL shapes model behavior, offering practical insights for building more adaptive, data-efficient, and interpretable RL pipelines for reasoning tasks. Our code, data, and checkpoints are available at: https://sparkle-reasoning.github.io/.
KOALA++: Efficient Kalman-Based Optimization with Gradient-Covariance Products
Xia, Zixuan, Davtyan, Aram, Favaro, Paolo
We propose KOALA++, a scalable Kalman-based optimization algorithm that explicitly models structured gradient uncertainty in neural network training. Unlike second-order methods, which rely on expensive second order gradient calculation, our method directly estimates the parameter covariance matrix by recursively updating compact gradient covariance products. This design improves upon the original KOALA framework that assumed diagonal covariance by implicitly capturing richer uncertainty structure without storing the full covariance matrix and avoiding large matrix inversions. Across diverse tasks, including image classification and language modeling, KOALA++ achieves accuracy on par or better than state-of-the-art first- and second-order optimizers while maintaining the efficiency of first-order methods.
Self-Refining Language Model Anonymizers via Adversarial Distillation
Kim, Kyuyoung, Jeon, Hyunjun, Shin, Jinwoo
Large language models (LLMs) are increasingly used in sensitive domains, where their ability to infer personal data from seemingly benign text introduces emerging privacy risks. While recent LLM-based anonymization methods help mitigate such risks, they often rely on proprietary models (e.g., GPT-4), raising concerns about cost and the potential exposure of sensitive data to untrusted external systems. To address this, we introduce SElf-refining Anonymization with Language model (SEAL), a novel distillation framework for training small language models (SLMs) to perform effective anonymization without relying on external models at inference time. SEAL leverages adversarial interactions between an LLM anonymizer and an inference model to collect trajectories of anonymized texts and inferred attributes, which are then used to distill anonymization and critique capabilities into SLMs through supervised fine-tuning and preference learning. The resulting models learn both to anonymize text and to evaluate their outputs, enabling iterative improvement of anonymization quality via self-refinement. Experiments on SynthPAI, a dataset of synthetic personal profiles and text comments, demonstrate that SLMs trained with SEAL achieve substantial improvements in anonymization capabilities. Notably, 8B models attain a privacy-utility trade-off comparable to that of the GPT-4 anonymizer and, with self-refinement, even surpass it in terms of privacy protection. These results highlight the effectiveness of our adversarial distillation framework for training SLMs as efficient anonymizers.
LayerIF: Estimating Layer Quality for Large Language Models using Influence Functions
Askari, Hadi, Gupta, Shivanshu, Wang, Fei, Chhabra, Anshuman, Chen, Muhao
Pretrained Large Language Models (LLMs) achieve strong performance across a wide range of tasks, yet exhibit substantial variability in the various layers' training quality with respect to specific downstream applications, limiting their downstream performance. It is therefore critical to estimate layer-wise training quality in a manner that accounts for both model architecture and training data. However, existing approaches predominantly rely on model-centric heuristics (such as spectral statistics, outlier detection, or uniform allocation) while overlooking the influence of data. To address these limitations, we propose LayerIF, a data-driven framework that leverages Influence Functions to quantify the training quality of individual layers in a principled and task-sensitive manner. By isolating each layer's gradients and measuring the sensitivity of the validation loss to training examples by computing layer-wise influences, we derive data-driven estimates of layer importance. Notably, our method produces task-specific layer importance estimates for the same LLM, revealing how layers specialize for different test-time evaluation tasks. We demonstrate the utility of our scores by leveraging them for two downstream applications: (a) expert allocation in LoRA-MoE architectures and (b) layer-wise sparsity distribution for LLM pruning. Experiments across multiple LLM architectures demonstrate that our model-agnostic, influence-guided allocation leads to consistent gains in task performance.
Identifying Super Spreaders in Multilayer Networks
Czuba, Michaล, Stolarski, Mateusz, Pirรณg, Adam, Bielak, Piotr, Brรณdka, Piotr
Identifying super-spreaders can be framed as a subtask of the influence maximisation problem. It seeks to pinpoint agents within a network that, if selected as single diffusion seeds, disseminate information most effectively. Multilayer networks, a specific class of heterogeneous graphs, can capture diverse types of interactions (e.g., physical-virtual or professional-social), and thus offer a more accurate representation of complex relational structures. In this work, we introduce a novel approach to identifying super-spreaders in such networks by leveraging graph neural networks. To this end, we construct a dataset by simulating information diffusion across hundreds of networks - to the best of our knowledge, the first of its kind tailored specifically to multilayer networks. We further formulate the task as a variation of the ranking prediction problem based on a four-dimensional vector that quantifies each agent's spreading potential: (i) the number of activations; (ii) the duration of the diffusion process; (iii) the peak number of activations; and (iv) the simulation step at which this peak occurs. Our model, TopSpreadersNetwork, comprises a relationship-agnostic encoder and a custom aggregation layer. This design enables generalisation to previously unseen data and adapts to varying graph sizes. In an extensive evaluation, we compare our model against classic centrality-based heuristics and competitive deep learning methods. The results, obtained across a broad spectrum of real-world and synthetic multilayer networks, demonstrate that TopSpreadersNetwork achieves superior performance in identifying high-impact nodes, while also offering improved interpretability through its structured output.
MESS+: Dynamically Learned Inference-Time LLM Routing in Model Zoos with Service Level Guarantees
Woisetschlรคger, Herbert, Zhang, Ryan, Wang, Shiqiang, Jacobsen, Hans-Arno
Open-weight large language model (LLM) zoos provide access to numerous high-quality models, but selecting the appropriate model for specific tasks remains challenging and requires technical expertise. Most users simply want factually correct, safe, and satisfying responses without concerning themselves with model technicalities, while inference service providers prioritize minimizing operating costs. These competing interests are typically mediated through service level agreements (SLAs) that guarantee minimum service quality. We introduce MESS+, a stochastic optimization algorithm for cost-optimal LLM request routing while providing rigorous SLA compliance guarantees. MESS+ learns request satisfaction probabilities of LLMs in real-time as users interact with the system, based on which model selection decisions are made by solving a per-request optimization problem. Our algorithm includes a novel combination of virtual queues and request satisfaction prediction, along with a theoretical analysis of cost optimality and constraint satisfaction. Across a wide range of state-of-the-art LLM benchmarks, MESS+ achieves an average of $2\times$ cost savings compared to existing LLM routing techniques.
Improved Regret and Contextual Linear Extension for Pandora's Box and Prophet Inequality
Liu, Junyan, Chen, Ziyun, Wang, Kun, Luo, Haipeng, Ratliff, Lillian J.
We study the Pandora's Box problem in an online learning setting with semi-bandit feedback. In each round, the learner sequentially pays to open up to $n$ boxes with unknown reward distributions, observes rewards upon opening, and decides when to stop. The utility of the learner is the maximum observed reward minus the cumulative cost of opened boxes, and the goal is to minimize regret defined as the gap between the cumulative expected utility and that of the optimal policy. We propose a new algorithm that achieves $\widetilde{O}(\sqrt{nT})$ regret after $T$ rounds, which improves the $\widetilde{O}(n\sqrt{T})$ bound of Agarwal et al. [2024] and matches the known lower bound up to logarithmic factors. To better capture real-life applications, we then extend our results to a natural but challenging contextual linear setting, where each box's expected reward is linear in some known but time-varying $d$-dimensional context and the noise distribution is fixed over time. We design an algorithm that learns both the linear function and the noise distributions, achieving $\widetilde{O}(nd\sqrt{T})$ regret. Finally, we show that our techniques also apply to the online Prophet Inequality problem, where the learner must decide immediately whether or not to accept a revealed reward. In both non-contextual and contextual settings, our approach achieves similar improvements and regret bounds.
Few-Shot Knowledge Distillation of LLMs With Counterfactual Explanations
Hamman, Faisal, Dissanayake, Pasan, Fu, Yanjun, Dutta, Sanghamitra
Knowledge distillation is a promising approach to transfer capabilities from complex teacher models to smaller, resource-efficient student models that can be deployed easily, particularly in task-aware scenarios. However, existing methods of task-aware distillation typically require substantial quantities of data which may be unavailable or expensive to obtain in many practical scenarios. In this paper, we address this challenge by introducing a novel strategy called Counterfactual-explanation-infused Distillation CoD for few-shot task-aware knowledge distillation by systematically infusing counterfactual explanations. Counterfactual explanations (CFEs) refer to inputs that can flip the output prediction of the teacher model with minimum perturbation. Our strategy CoD leverages these CFEs to precisely map the teacher's decision boundary with significantly fewer samples. We provide theoretical guarantees for motivating the role of CFEs in distillation, from both statistical and geometric perspectives. We mathematically show that CFEs can improve parameter estimation by providing more informative examples near the teacher's decision boundary. We also derive geometric insights on how CFEs effectively act as knowledge probes, helping the students mimic the teacher's decision boundaries more effectively than standard data. We perform experiments across various datasets and LLMs to show that CoD outperforms standard distillation approaches in few-shot regimes (as low as 8-512 samples). Notably, CoD only uses half of the original samples used by the baselines, paired with their corresponding CFEs and still improves performance.
Bloody Mary, Bloody Mary, Bloody Mary: How the classic sleepover party game really CAN summon a ghost in your mirror
Tupac's humiliating intimate disfigurement revealed... and how his lies to cover it up led to his murder I've started having heart palpitations. 'Black Ivy League' university looks to expand into crime-riddled Oakland Kristen Bell's friends turn on her with savage disclosures: Insiders reveal poisonous whispers behind her back... as she goes into full diva mode Shooting leaves two dead and 11 injured at large house party with'underage people' in North Carolina Kim Kardashian's just been caught in a despicable lie. She can cry all she wants... there's no hiding the truth now: CAROLINE BULLOCK The'marry me' sex move that'll make even the most commitment-phobic of men beg to see you again... and it worked for THREE of my friends Prosecutor who declined to charge Letitia James with bank fraud fired after'mishandling evidence' Californians being urged to take up arms to deal with'aggressive' invasive species attacking children Inside Andrew's family summit: How Fergie wailed and'melted down' at title loss, Beatrice and Eugenie were'blindsided' and now daughters' assets face'ethics check' to avoid more scandal: BARBARA DAVIES LIZ JONES: I was devastated when my husband cheated. But here's the reason part of me was secretly glad that every woman over-50 will understand Psychotherapist explains why No Kings rallies consisted of mostly'educated white women' Tree optical illusion messes with your mind - you can see the squirrel but can you spot the cat in 30 seconds? Turn off the lights, burn a candle, look into the mirror and say the magic words: 'Bloody Mary, Bloody Mary, Bloody Mary'.
How to create three easy Halloween makeup looks with GlowUp's Axel
How to create three easy Halloween makeup looks with GlowUp's Axel Axel A D Brown was one of the final three close MUA An abbreviation of makeup artist on series 5 of Glow Up. Their background is in drag, sci-fi and nerd culture. This means Axel's take on using make up to create glamorous creatures and creepy monsters is perfect for this time of year. There's something special about Halloween that allows people to let their guard down and fully express themselves, Axel says. Nobody really cares what you're doing because everyone's crazy and weird looking." BBC Bitesize asked Axel to show us how to create three seriously spooky glow-ups. How to paint a frightening Frankenstein's monster head Start with a clean, moisturised face. Green, white, red and black makeup is all you need to complete this look. Slide 1 of 4, Man with a clean face dressed in a black shirt, Step 1 Start with a clean, moisturised face. Go full-on with the orange face paint and get ready to do some more line work. Use a fine paint brush and if you're worried about wobbly lines, try holding your elbow while you draw as this can steady a shaky hand Don't worry if you don't have any - you can achieve the same look with a green water-based paint, just add clear hairspray to help it stay on all night Axel loves to do makeup looks based on creatures and monsters. "Whenever I'm sketching out ideas, the first thing that comes to me is usually a colour combination that intrigues me, Axel says.