maj
- North America > United States (0.14)
- South America > Chile (0.05)
- North America > Canada > British Columbia > Metro Vancouver Regional District > Vancouver (0.04)
- Europe > France (0.04)
Don't Learn, Ground: A Case for Natural Language Inference with Visual Grounding
Ignatev, Daniil, Santeer, Ayman, Gatt, Albert, Paperno, Denis
We propose a zero-shot method for Natural Language Inference (NLI) that leverages multimodal representations by grounding language in visual contexts. Our approach generates visual representations of premises using text-to-image models and performs inference by comparing these representations with textual hypotheses. We evaluate two inference techniques: cosine similarity and visual question answering. Our method achieves high accuracy without task-specific fine-tuning, demonstrating robustness against textual biases and surface heuristics. Additionally, we design a controlled adversarial dataset to validate the robustness of our approach. Our findings suggest that leveraging visual modality as a meaning representation provides a promising direction for robust natural language understanding.
- North America > United States > New Mexico > Santa Fe County > Santa Fe (0.04)
- Europe > Italy > Tuscany > Florence (0.04)
- North America > United States > Louisiana > Orleans Parish > New Orleans (0.04)
- (8 more...)
Online Mixture of Experts: No-Regret Learning for Optimal Collective Decision-Making
We explore the use of expert-guided bandit learning, which we refer to as online mixture-of-experts (OMoE). In this setting, given a context, a candidate committee of experts must determine how to aggregate their outputs to achieve optimal results in terms of aggregate accuracy. We propose two algorithms to address this problem. The first algorithm combines aggregate voting with UCB-driven successive elimination, efficiently pruning suboptimal exploration actions. The second algorithm employs an online weighted-majority-voting mechanism, leveraging the respective voting power of each expert proportional to their predictive power. We derive theoretical guarantees for the regret properties in the bandit setting under ideal circumstances, and empirical results are provided accordingly. As a modern study on applications, these methods are applied to the online fine-tuning of a set of expert large language models (LLMs), where after each response, the generative LLM dynamically reweighs its set of experts and/or selects the optimal committee of experts to generate the most accurate response. Our results introduce new methodologies and no-regret guarantees for combining multiple experts to improve on the performance of the an aggregate model overall.
- North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
- Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
- Asia > Singapore (0.04)
- Asia > Indonesia > Bali (0.04)
- Research Report > Experimental Study (1.00)
- Research Report > New Finding (0.87)
Explore Data Left Behind in Reinforcement Learning for Reasoning Language Models
Liu, Chenxi, Liang, Junjie, Jia, Yuqi, Cao, Bochuan, Bai, Yang, Huang, Heng, Chen, Xun
Reinforcement Learning with Verifiable Rewards (RLVR) has emerged as an effective approach for improving the reasoning abilities of large language models (LLMs). The Group Relative Policy Optimization (GRPO) family has demonstrated strong performance in training LLMs with RLVR. However, as models train longer and scale larger, more training prompts become residual prompts, those with zero variance rewards that provide no training signal. Consequently, fewer prompts contribute to training, reducing diversity and hindering effectiveness. To fully exploit these residual prompts, we propose the Explore Residual Prompts in Policy Optimization (ERPO) framework, which encourages exploration on residual prompts and reactivates their training signals. ERPO maintains a history tracker for each prompt and adaptively increases the sampling temperature for residual prompts that previously produced all correct responses. This encourages the model to generate more diverse reasoning traces, introducing incorrect responses that revive training signals. Empirical results on the Qwen2.5 series demonstrate that ERPO consistently surpasses strong baselines across multiple mathematical reasoning benchmarks.
- North America > United States > Pennsylvania (0.04)
- North America > United States > Maryland > Prince George's County > College Park (0.04)
Explaining Decisions in ML Models: a Parameterized Complexity Analysis (Part I)
Ordyniak, Sebastian, Paesani, Giacomo, Rychlicki, Mateusz, Szeider, Stefan
This paper presents a comprehensive theoretical investigation into the parameterized complexity of explanation problems in various machine learning (ML) models. Contrary to the prevalent black-box perception, our study focuses on models with transparent internal mechanisms. We address two principal types of explanation problems: abductive and contrastive, both in their local and global variants. Our analysis encompasses diverse ML models, including Decision Trees, Decision Sets, Decision Lists, Boolean Circuits, and ensembles thereof, each offering unique explanatory challenges. This research fills a significant gap in explainable AI (XAI) by providing a foundational understanding of the complexities of generating explanations for these models. This work provides insights vital for further research in the domain of XAI, contributing to the broader discourse on the necessity of transparency and accountability in AI systems.
- Europe > Austria > Vienna (0.14)
- Europe > United Kingdom > England > West Yorkshire > Leeds (0.04)
- Europe > Portugal > Lisbon > Lisbon (0.04)
- (4 more...)
- Overview (0.92)
- Research Report > New Finding (0.67)
MASPRM: Multi-Agent System Process Reward Model
Yazdani, Milad, Mostajabdaveh, Mahdi, Zhou, Zirui, Xiong, Ying
Practical deployment of Multi-Agent Systems (MAS) demands strong test-time performance, motivating methods that guide inference-time search and selectively spend compute to improve quality. We present the Multi-Agent System Process Reward Model (MASPRM). It assigns per-action, per-agent values to partial inter-agent transcripts and acts as an inference-time controller. MASPRM is trained from multi-agent Monte Carlo Tree Search (MCTS) rollouts without requiring step-level human annotations, by propagating returns to local targets. At inference, MASPRM guides step-level beam search and MCTS, focusing computation on promising branches and pruning early. On GSM8K and MATH, MASPRM-guided decoding with an outcome reward model (ORM) applied to the final answer, improves exact match (EM) over a single straight-through MAS pass by $+30.7$ and $+22.9$ points, respectively. A MASPRM trained on GSM8K transfers zero-shot to MATH without retraining, adding $8.4$ EM points at the same budget. MASPRM is a plug-in value model that estimates per-agent progress and complements verifier-style decoders, enabling more reliable, compute-aware multi-agent reasoning. Code: https://github.com/milad1378yz/MASPRM
- North America > Canada > British Columbia > Metro Vancouver Regional District > Burnaby (0.14)
- North America > Canada > British Columbia > Vancouver (0.04)
- Asia > Myanmar > Tanintharyi Region > Dawei (0.04)
Sample Smart, Not Hard: Correctness-First Decoding for Better Reasoning in LLMs
Li, Xueyan, Su, Guinan, Sachan, Mrinmaya, Geiping, Jonas
Large Language Models (LLMs) are increasingly applied to complex tasks that require extended reasoning. In such settings, models often benefit from diverse chains-of-thought to arrive at multiple candidate solutions. This requires two competing objectives: to inject enough stochasticity to explore multiple reasoning chains, and to ensure sufficient accuracy and quality in each path. Existing works pursue the first objective by increasing exploration at highly uncertain steps with higher temperature or larger candidate token sets, while others improve reliability by rejecting samples with low confidence post-generation, implying that low confidence correlates with low answer quality. These two lines of thought are in conflict, as they conflate different sources of uncertainty. To resolve this, we argue that the decoding rule should be calibrated by correctness, not confidence alone. We should sample from tokens with higher estimated correctness, and reduce sampling where expected correctness is low. We propose simple strategies that achieve this goal: Greedy-Threshold makes sampling greedy at very low confidence steps. Calibrated-T opK and Calibrated-ε set truncation threshold based on estimated rank-wise correctness. Large Language Models (LLMs) are used for a wide range of generation tasks, ranging from open-ended text to structured problem-solving. In many cases, producing more than one candidate output improves not only fluency, but also reliability, since different samples may capture alternative valid continuations (Wang et al., 2023; Lin et al., 2024). This practice highlights a fundamental trade-off: introducing enough randomness to explore multiple options while still ensuring the accuracy and quality of each individual output (Tan et al., 2024; Meister et al., 2024; Shi et al., 2024). Existing works optimize exploration by raising temperatures or enlarging candidate token sets step-by-step (Nguyen et al., 2025; Zhang et al., 2024; Hewitt et al., 2022). These methods assume that higher entropy is a signal of uncertainty between multiple valid next steps, warranting broader exploration.
- North America > United States > Florida > Miami-Dade County > Miami (0.04)
- Oceania > Australia > Victoria > Melbourne (0.04)
- Europe > Switzerland > Zürich > Zürich (0.04)
TokUR: Token-Level Uncertainty Estimation for Large Language Model Reasoning
Zhang, Tunyu, Shi, Haizhou, Wang, Yibin, Wang, Hengyi, He, Xiaoxiao, Li, Zhuowei, Chen, Haoxian, Han, Ligong, Xu, Kai, Zhang, Huan, Metaxas, Dimitris, Wang, Hao
While Large Language Models (LLMs) have demonstrated impressive capabilities, their output quality remains inconsistent across various application scenarios, making it difficult to identify trustworthy responses, especially in complex tasks requiring multi-step reasoning. In this paper, we propose a Token-level Uncertainty estimation framework for Reasoning (TokUR) that enables LLMs to self-assess and self-improve their responses in mathematical reasoning. Specifically, we introduce low-rank random weight perturbation during LLM decoding to generate predictive distributions for token-level uncertainty estimation, and we aggregate these uncertainty quantities to capture the semantic uncertainty of generated responses. Experiments on mathematical reasoning datasets of varying difficulty demonstrate that TokUR exhibits a strong correlation with answer correctness and model robustness, and the uncertainty signals produced by TokUR can be leveraged to enhance the model's reasoning performance at test time. These results highlight the effectiveness of TokUR as a principled and scalable approach for improving the reliability and interpretability of LLMs in challenging reasoning tasks.
- North America > United States > Virginia (0.04)
- Europe > United Kingdom (0.04)
- Europe > Czechia > Prague (0.04)
- (3 more...)
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (1.00)
- Information Technology > Artificial Intelligence > Representation & Reasoning > Uncertainty > Bayesian Inference (0.92)
- Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Directed Networks > Bayesian Learning (0.67)
Single-stream Policy Optimization
We revisit policy-gradient optimization for Large Language Models (LLMs) from a single-stream perspective. Prevailing group-based methods like GRPO reduce variance with on-the-fly baselines but suffer from critical flaws: frequent degenerate groups erase learning signals, and synchronization barriers hinder scalability. We introduce Single-stream Policy Optimization (SPO), which eliminates these issues by design. SPO replaces per-group baselines with a persistent, KL-adaptive value tracker and normalizes advantages globally across the batch, providing a stable, low-variance learning signal for every sample. Being group-free, SPO enables higher throughput and scales effectively in long-horizon or tool-integrated settings where generation times vary. Furthermore, the persistent value tracker naturally enables an adaptive curriculum via prioritized sampling. Experiments using Qwen3-8B show that SPO converges more smoothly and attains higher accuracy than GRPO, while eliminating computation wasted on degenerate groups. Ablation studies confirm that SPO's gains stem from its principled approach to baseline estimation and advantage normalization, offering a more robust and efficient path for LLM reasoning. Across five hard math benchmarks with Qwen3 8B, SPO improves the average maj@32 by +3.4 percentage points (pp) over GRPO, driven by substantial absolute point gains on challenging datasets, including +7.3 pp on BRUMO 25, +4.4 pp on AIME 25, +3.3 pp on HMMT 25, and achieves consistent relative gain in pass@$k$ across the evaluated $k$ values. SPO's success challenges the prevailing trend of adding incidental complexity to RL algorithms, highlighting a path where fundamental principles, not architectural workarounds, drive the next wave of progress in LLM reasoning.
Improving LLM Agents with Reinforcement Learning on Cryptographic CTF Challenges
Muzsai, Lajos, Imolai, David, Lukács, András
We present 'Random-Crypto', a procedurally generated cryptographic Capture The Flag (CTF) dataset designed to unlock the potential of Reinforcement Learning (RL) for LLM-based agents in security-sensitive domains. Cryptographic reasoning offers an ideal RL testbed: it combines precise validation, structured multi-step inference, and reliance on reliable computational tool use. Leveraging these properties, we fine-tune a Python tool-augmented Llama-3.1-8B via Group Relative Policy Optimization (GRPO) in a secure execution environment. The resulting agent achieves a significant improvement in Pass@8 on previously unseen challenges. Moreover, the improvements generalize to two external benchmarks: 'picoCTF', spanning both crypto and non-crypto tasks, and 'AICrypto MCQ', a multiple-choice benchmark of 135 cryptography questions. Ablation studies attribute the gains to enhanced tool usage and procedural reasoning. These findings position 'Random-Crypto' as a rich training ground for building intelligent, adaptable LLM agents capable of handling complex cybersecurity tasks.
- Information Technology > Security & Privacy (1.00)
- Government > Military > Cyberwarfare (0.36)