Goto

Collaborating Authors

 cont





TOIST: TaskOrientedInstanceSegmentation TransformerwithNoun-PronounDistillation SupplementaryMaterial

Neural Information Processing Systems

As mentioned in Section 3(formulation) of the main paper, in an input image, it is possible that no objects or multiple objects afford a specific task. As areminder,we use the whole verb-pronoun (or verb-noun) description as token span. With probability 0.5, an image is cropped to a random size, where each side is between384and1333pixels. Both of the student and teacher TOIST models are initialized with the model pre-trained by [4]. In an image, the most suitable objects (one or more) for solving the task are selected and their bounding boxes are taken as ground truth labels for detection.


07211688a0869d995947a8fb11b215d6-AuthorFeedback.pdf

Neural Information Processing Systems

We thank all the anonymous reviewers for their constructive feedback. We address each comment as follows. R1-Q2:Just using the predicted mask to concat. R1-Q3:Refine the predicted mask with CRF . SEAM show that CRF ( vs CONT A) is only effective in the first round, i .


The Good, The Bad, and The Hybrid: A Reward Structure Showdown in Reasoning Models Training

Sahoo, Subramanyam

arXiv.org Artificial Intelligence

Reward design is central to reinforcement learning from human feedback (RLHF) and alignment research. In this work, we propose a unified framework to study hard, continuous, and hybrid reward structures for fine-tuning large language models (LLMs) on mathematical reasoning tasks. Using Qwen3-4B with LoRA fine-tuning on the GSM8K dataset, we formalize and empirically evaluate reward formulations that incorporate correctness, perplexity, reasoning quality, and consistency. We introduce an adaptive hybrid reward scheduler that transitions between discrete and continuous signals, balancing exploration and stability. Our results show that hybrid reward structures improve convergence speed and training stability over purely hard or continuous approaches, offering insights for alignment via adaptive reward modeling.


DRL-Based Resource Allocation for Energy-Efficient IRS-Assisted UAV Spectrum Sharing Systems

Wang, Yiheng

arXiv.org Artificial Intelligence

Intelligent reflecting surface (IRS) assisted unmanned aerial vehicle (UAV) systems provide a new paradigm for reconfigurable and flexible wireless communications. To enable more energy efficient and spectrum efficient IRS assisted UAV wireless communications, this paper introduces a novel IRS-assisted UAV enabled spectrum sharing system with orthogonal frequency division multiplexing (OFDM). The goal is to maximize the energy efficiency (EE) of the secondary network by jointly optimizing the beamforming, subcarrier allocation, IRS phase shifts, and the UAV trajectory subject to practical transmit power and passive reflection constraints as well as UAV physical limitations. A physically grounded propulsion-energy model is adopted, with its tight upper bound used to form a tractable EE lower bound for the spectrum sharing system. To handle highly non convex, time coupled optimization problems with a mixed continuous and discrete policy space, we develop a deep reinforcement learning (DRL) approach based on the actor critic framework. Extended experiments show the significant EE improvement of the proposed DRL-based approach compared to several benchmark schemes, thus demonstrating the effectiveness and robustness of the proposed approach with mobility.


Stop-RAG: Value-Based Retrieval Control for Iterative RAG

Park, Jaewan, Cho, Solbee, Lee, Jay-Yoon

arXiv.org Artificial Intelligence

Iterative retrieval-augmented generation (RAG) enables large language models to answer complex multi-hop questions, but each additional loop increases latency, costs, and the risk of introducing distracting evidence, motivating the need for an efficient stopping strategy. Existing methods either use a predetermined number of iterations or rely on confidence proxies that poorly reflect whether more retrieval will actually help. We cast iterative RAG as a finite-horizon Markov decision process and introduce Stop-RAG, a value-based controller that adaptively decides when to stop retrieving. Trained with full-width forward-view Q($λ$) targets from complete trajectories, Stop-RAG learns effective stopping policies while remaining compatible with black-box APIs and existing pipelines. On multi-hop question-answering benchmarks, Stop-RAG consistently outperforms both fixed-iteration baselines and prompting-based stopping with LLMs. These results highlight adaptive stopping as a key missing component in current agentic systems, and demonstrate that value-based control can improve the accuracy of RAG systems.