Goto

Collaborating Authors

 Technology


Iterative Self-Incentivization Empowers Large Language Models as Agentic Searchers

Neural Information Processing Systems

Large language models (LLMs) have been widely integrated into information retrieval to advance traditional techniques. However, effectively enabling LLMs to seek accurate knowledge in complex tasks remains a challenge due to the complexity of multi-hop queries as well as the irrelevant retrieved content. To address these limitations, we propose ExSearch, an agentic search framework, where the LLM learns to retrieve useful information as the reasoning unfolds through a self-incentivized process. At each step, the LLM decides what to retrieve (thinking), triggers an external retriever (search), and extracts fine-grained evidence (recording) to support next-step reasoning. To enable LLM with this capability, we adopts a Generalized Expectation-Maximization algorithm.


How Benchmark Prediction from Fewer Data Misses the Mark

Neural Information Processing Systems

Large language model (LLM) evaluation is increasingly costly, prompting interest in methods that speed up evaluation by shrinking benchmark datasets. Benchmark prediction (also called efficient LLM evaluation) aims to select a small subset of evaluation points and predict overall benchmark performance from that subset. In this paper, we systematically assess the strengths and limitations of 11 benchmark prediction methods across 19 diverse benchmarks. First, we identify a highly competitive baseline: Take a random sample and fit a regression model on the sample to predict missing entries. Outperforming most existing methods, this baseline challenges the assumption that careful subset selection is necessary for benchmark prediction.


A Closer Look at TabPFN v2: Understanding Its Strengths and Extending Its Capabilities

Neural Information Processing Systems

Tabular datasets are inherently heterogeneous, presenting significant challenges for developing pre-trained foundation models. The recently introduced transformer-based Tabular Prior-data Fitted Network v2 (TabPFN v2) achieves unprecedented performance across diverse downstream datasets, marking a pivotal advancement in tabular foundation models. In this paper, we take a closer look at TabPFN v2 to examine how it effectively handles heterogeneity and achieves high predictive accuracy, and to explore how its limitations in high-dimensional, many-category, and large-scale tasks can be mitigated. We find that TabPFN v2 can infer attribute relationships even when provided with randomized attribute token inputs, eliminating the need to explicitly learn dataset-specific attribute embeddings to address heterogeneity. We further show that TabPFN v2 can be transformed into a feature extractor, revealing its ability to construct a highly separable feature space for accurate predictions. Lastly, we demonstrate that TabPFN v2's limitations can be addressed through a test-time divide-and-conquer strategy, enabling scalable inference without requiring re-training. By uncovering the mechanisms behind TabPFN v2's success and introducing strategies to extend its applicability, this study offers key insights into the design of future tabular foundation models.


Greedy Algorithms for Structured Bandits: A Sharp Characterization of Asymptotic Success / Failure

Neural Information Processing Systems

We study the greedy (exploitation-only) algorithm in bandit problems with a known reward structure. We allow arbitrary finite reward structures, while prior work focused on a few specific ones. We fully characterize when the greedy algorithm asymptotically succeeds or fails, in the sense of sublinear vs. linear regret as a function of time. Our characterization identifies a partial identifiability property of the problem instance as the necessary and sufficient condition for the asymptotic success. Notably, once this property holds, the problem becomes easy--\emph{any} algorithm will succeed (in the same sense as above), provided it satisfies a mild non-degeneracy condition. Our characterization extends to contextual bandits and interactive decision-making with arbitrary feedback. Examples demonstrating broad applicability and extensions to infinite reward structures are provided.


DIFFSSR: Stereo Image Super-resolution Using Differential Transformer

Neural Information Processing Systems

In the field of computer vision, the task of stereo image super-resolution (StereoSR) has garnered significant attention due to its potential applications in augmented reality, virtual reality, and autonomous driving. Traditional Transformer-based models, while powerful, often suffer from attention noise, leading to suboptimal reconstruction issues in super-resolved images. This paper introduces DIFFSSR, a novel neural network architecture designed to address these challenges. We introduce the Diff Cross Attention Block (DCAB) and the Sliding Stereo Cross-Attention Module (SSCAM) to enhance feature integration and mitigate the impact of attention noise.


Long-tailed Recognition with Model Rebalancing

Neural Information Processing Systems

Long-tailed recognition is ubiquitous and challenging in deep learning and even in the downstream finetuning of foundation models, since the skew class distribution generally prevents the model generalization to the tail classes. Despite the promise of previous methods from the perspectives of data augmentation, loss rebalancing and decoupled training etc., consistent improvement in the broad scenarios like multi-label long-tailed recognition is difficult. In this study, we dive into the essential model capacity impact under long-tailed context, and propose a novel framework, Model Rebalancing (MORE), which mitigates imbalance by directly rebalancing the model's parameter space. Specifically, MORE introduces a low-rank parameter component to mediate the parameter space allocation guided by a tailored loss and sinusoidal reweighting schedule, but without increasing the overall model complexity or inference costs. Extensive experiments on diverse long-tailed benchmarks, spanning multi-class and multi-label tasks, demonstrate that MORE significantly improves generalization, particularly for tail classes, and effectively complements existing imbalance mitigation methods.


Better Tokens for Better 3D: Advancing Vision-Language Modeling in 3D Medical Imaging

Neural Information Processing Systems

Recent progress in vision-language modeling for 3D medical imaging has been fueled by large-scale computed tomography (CT) corpora with paired free-text reports, stronger architectures, and powerful pretrained models. This has enabled applications such as automated report generation and text-conditioned 3D image synthesis. Yet, current approaches struggle with high-resolution, long-sequence volumes: contrastive pretraining often yields vision encoders that are misaligned with clinical language, and slice-wise tokenization blurs fine anatomy, reducing diagnostic performance on downstream tasks. We introduce BTB3D (Better Tokens for Better 3D), a causal convolutional encoder-decoder that unifies 2D and 3D training and inference while producing compact, frequency-aware volumetric tokens. A three-stage training curriculum enables (i) local reconstruction, (ii) overlapping-window tiling, and (iii) long-context decoder refinement, during which the model learns from short slice excerpts yet generalizes to scans exceeding $300$ slices without additional memory overhead. BTB3D sets a new state-of-the-art on two key tasks: it improves BLEU scores and increases clinical F1 by 40\% over CT2Rep, CT-CHAT, and Merlin for report generation; and it reduces FID by 75\% and halves FVD compared to GenerateCT and MedSyn for text-to-CT synthesis, producing anatomically consistent $512\times512\times241$ volumes. These results confirm that precise three-dimensional tokenization, rather than larger language backbones alone, is essential for scalable vision-language modeling in 3D medical imaging.


Automaton Constrained Q-Learning

Neural Information Processing Systems

Real-world robotic tasks often require agents to achieve sequences of goals while respecting time-varying safety constraints. However, standard Reinforcement Learning (RL) paradigms are fundamentally limited in these settings. A natural approach to these problems is to combine RL with Linear-time Temporal Logic (LTL), a formal language for specifying complex, temporally extended tasks and safety constraints.


Smooth Sailing: Lipschitz-Driven Uncertainty Quantification for Spatial Associations

Neural Information Processing Systems

Estimating associations between spatial covariates and responses -- rather than merely predicting responses -- is central to environmental science, epidemiology, and economics. For instance, public health officials might be interested in whether air pollution has a strictly positive association with a health outcome, and the magnitude of any effect. Standard machine learning methods often provide accurate predictions but offer limited insight into covariate-response relationships. And we show that existing methods for constructing confidence (or credible) intervals for associations can fail to provide nominal coverage in the face of model misspecification and nonrandom locations -- despite both being essentially always present in spatial problems. We introduce a method that constructs valid frequentist confidence intervals for associations in spatial settings. Our method requires minimal assumptions beyond a form of spatial smoothness and a homoskedastic Gaussian error assumption. In particular, we do not require model correctness or covariate overlap between training and target locations. Our approach is the first to guarantee nominal coverage in this setting and outperforms existing techniques in both real and simulated experiments. Our confidence intervals are valid in finite samples when the noise of the Gaussian error is known, and we provide an asymptotically consistent estimation procedure for this noise variance when it is unknown.


QuestBench: Can LLMs ask the right question to acquire information in reasoning tasks?

Neural Information Processing Systems

Large language models (LLMs) have shown impressive performance on reasoning benchmarks like math and logic. While many works have largely assumed well-defined tasks, real-world queries are often underspecified and only solvable by acquiring missing information.