Goto

Collaborating Authors

 Genre


One Stone with Two Birds: ANull-Text-Null Frequency-Aware Diffusion Models for Text-Guided Image Inpainting

Neural Information Processing Systems

Text-guided image inpainting aims at reconstructing the masked regions as per text prompts, where the longstanding challenges lie in the preservation for unmasked regions, while achieving the semantics consistency between unmasked and inpainted masked regions. Previous arts failed to address both of them, always with either of them to be remedied. Such facts, as we observed, stem from the entanglement of the hybrid (e.g., mid-and-low) frequency bands that encode varied image properties, which exhibit different robustness to text prompts during the denoising process. In this paper, we propose a null-text-null frequency-aware diffusion models, dubbed NTN-Diff, for text-guided image inpainting, by decomposing the semantics consistency across masked and unmasked regions into the consistencies as per each frequency band, while preserving the unmasked regions, to circumvent two challenges in a row. Based on the diffusion process, we further divide the denoising process into early (high-level noise) and late (low-level noise) stages, where the mid-and-low frequency bands are disentangled during the denoising process. As observed, the stable mid-frequency band is progressively denoised to be semantically aligned during text-guided denoising process, which, meanwhile, serves as the guidance to the null-text denoising process to denoise low-frequency band for the masked regions, followed by a subsequent text-guided denoising process at late stage, to achieve the semantics consistency for mid-and-low frequency bands across masked and unmasked regions, while preserve the unmasked regions.


MiCo: Multi-image Contrast for Reinforcement Visual Reasoning

Neural Information Processing Systems

This work explores enabling Chain-of-Thought (CoT) reasoning to link visual cues across multiple images. A straightforward solution is to adapt rule-based reinforcement learning for Vision-Language Models (VLMs). However, such methods typically rely on manually curated question-answer pairs, which can be particularly challenging when dealing with fine-grained visual details and complex logic across images. Inspired by self-supervised visual representation learning, we observe that images contain inherent constraints that can serve as supervision. Based on this insight, we construct image triplets comprising two augmented views of the same image and a third, similar but distinct image. During training, the model is prompted to generate a reasoning process to compare these images (i.e., determine same or different).


0fa694fb9f1e265117e8da75966820fe-Paper-Conference.pdf

Neural Information Processing Systems

We consider how to construct state abstractions compatible with a given set of abstract actions, to obtain a well-formed abstract Markov decision process (MDP). We show that the Bellman equation suggests that abstract states should represent distributions over states in the ground MDP; we characterize the conditions under which the resulting process is Markov and approximately model-preserving, derive an algorithm for constructing the abstract MDP, and apply it to visual chain and maze tasks. We generalize these results to the factored actions case, characterize the conditions that lead to factored abstract states, and apply the resulting algorithm to a visual grid and Montezuma's Revenge. These results provide a principled, powerful framework for learning neurosymbolic abstract Markov decision processes.


Trained Mamba Emulates Online Gradient Descent in In-Context Linear Regression

Neural Information Processing Systems

State-space models (SSMs), particularly Mamba, emerge as an efficient Transformer alternative with linear complexity for long-sequence modeling. Recent empirical works demonstrate Mamba's in-context learning (ICL) capabilities competitive with Transformers, a critical capacity for large foundation models. However, theoretical understanding of Mamba's ICL remains limited, restricting deeper insights into its underlying mechanisms. Even fundamental tasks such as linear regression ICL, widely studied as a standard theoretical benchmark for Transformers, have not been thoroughly analyzed in the context of Mamba. To address this gap, we study the training dynamics of Mamba on the linear regression ICL task. By developing novel techniques tackling non-convex optimization with gradient descent related to Mamba's structure, we establish an exponential convergence rate to ICL solution, and derive a loss bound that is comparable to Transformer's. Importantly, our results reveal that Mamba can perform a variant of online gradient descent to learn the latent function in context. This mechanism is different from that of Transformer, which is typically understood to achieve ICL through gradient descent emulation. The theoretical results are verified by experimental simulation.


Beyond Prediction: Managing the Repercussions of Machine Learning Applications

Neural Information Processing Systems

Machine learning models are often designed to maximize a primary goal, such as accuracy. However, as these models are increasingly used to inform decisions that affect people's lives or well-being, it is often unclear what the real-world repercussions of their deployment might be--making it crucial to understand and manage such repercussions effectively. Models maximizing user engagement on social media platforms, e.g., may inadvertently contribute to the spread of misinformation and content that deepens political polarization. This issue is not limited to social media--it extends to other applications where machine learning-informed decisions can have real-world repercussions, such as education, employment, and lending. Existing methods addressing this issue require prior knowledge or estimates of analytical models describing the relationship between a classifier's predictions and their corresponding repercussions. We introduce THEIA, a novel classification algorithm capable of optimizing a primary objective, such as accuracy, while providing high-confidence guarantees about its potential repercussions. Importantly, THEIA solves the open problem of providing such guarantees based solely on existing data with observations of previous repercussions. We prove that it satisfies constraints on a model's repercussions with high confidence and that it is guaranteed to identify a solution, if one exists, given sufficient data. We empirically demonstrate, using real-life data, that THEIA can identify models that achieve high accuracy while ensuring, with high confidence, that constraints on their repercussions are satisfied.


Deep Value Benchmark: Measuring Whether Models Generalize Deep Values or Shallow Preferences

Neural Information Processing Systems

We introduce the Deep Value Benchmark (DVB), an evaluation framework that directly tests whether large language models (LLMs) learn fundamental human values or merely surface-level preferences. This distinction is critical for AI alignment: Systems that capture deeper values are likely to generalize human intentions robustly, while those that capture only superficial patterns in preference data risk producing misaligned behavior. The DVB uses a novel experimental design with controlled confounding between deep values (e.g., moral principles) and shallow features (e.g., superficial attributes). In the training phase, we expose LLMs to human preference data with deliberately correlated deep and shallow features--for instance, where a user consistently prefers (non-maleficence, formal language) options over (justice, informal language) alternatives. The testing phase then breaks these correlations, presenting choices between (justice, formal language) and (non-maleficence, informal language) options.


Diffusion Beats Autoregressive in Data-Constrained Settings

Neural Information Processing Systems

Autoregressive (AR) models have long dominated the landscape of large language models, driving progress across a wide range of tasks. Recently, diffusion-based language models have emerged as a promising alternative, though their advantages over AR models remain underexplored. In this paper, we systematically study masked diffusion models in data-constrained settings--where training involves repeated passes over limited data--and find that they significantly outperform AR models when compute is abundant but data is scarce. Diffusion models make better use of repeated data, achieving lower validation loss and superior downstream performance. We find new scaling laws for diffusion models and derive a closedform expression for the critical compute threshold at which diffusion begins to outperform AR. Finally, we explain why diffusion models excel in this regime: their randomized masking objective implicitly trains over a rich distribution of token orderings, acting as an implicit data augmentation that AR's fixed left-toright factorization lacks. Our results suggest that when data, not compute, is the bottleneck, diffusion models offer a compelling alternative to the standard AR paradigm.


Synthetic Series-Symbol Data Generation for Time Series Foundation Models

Neural Information Processing Systems

Foundation models for time series analysis (TSA) have attracted significant attention. However, challenges such as training data scarcity and imbalance continue to hinder their development. Inspired by complex dynamic system theories, we design a series-symbol data generation mechanism, enabling the unrestricted creation of high-quality time series data paired with corresponding symbolic expressions. To leverage series-symbol data pairs with strong correlations, we develop SymTime, a pre-trained foundation model for enhancing time series representation using symbolic information. SymTime demonstrates competitive performance across five major TSA tasks when fine-tunes with downstream tasks, rivaling foundation models pre-trained on real-world datasets. This approach underscores the potential of series-symbol data generation and pretraining mechanisms in overcoming data scarcity and enhancing task performance.


LLM-Driven Treatment Effect Estimation Under Inference Time Text Confounding

Neural Information Processing Systems

Estimating treatment effects is crucial for personalized decision-making in medicine, but this task faces unique challenges in clinical practice. At training time, models for estimating treatment effects are typically trained on well-structured medical datasets that contain detailed patient information. However, at inference time, predictions are often made using textual descriptions (e.g., descriptions with self-reported symptoms), which are incomplete representations of the original patient information. In this work, we make three contributions.


Accurate KVCache Eviction via Anchor Direction Projection for Efficient LLMInference

Neural Information Processing Systems

Key-Value (KV) cache eviction--which retains the KV pairs of the most important tokens while discarding less important ones--is a critical technique for optimizing both memory usage and inference latency in large language models (LLMs). However, existing approaches often rely on simple heuristics--such as attention weights--to measure token importance, overlooking the spatial relationships between token value states in the vector space. This often leads to suboptimal token selections and thus performance degradation. To tackle this problem, we propose a novel method, namely AnDPro (Anchor Direction Projection), which introduces a projection-based scoring function to more accurately measure token importance. Specifically, AnDPro operates in the space of value vectors and leverages the projections of these vectors onto an "Anchor Direction"--the direction of the pre-eviction output--to measure token importance and guide more accurate token selection. Experiments on 16datasets from the LongBench benchmark demonstrate that AnDPro can maintain 96.07%of the full cache accuracy using only 3.44%KV cache budget, reducing KV cache budget size by 46.0% without compromising quality compared to previous state-of-the-arts.