Goto

Collaborating Authors

 Technology


Performative Validity of Recourse Explanations

Neural Information Processing Systems

When applicants get rejected by a high-stakes algorithmic decision system, recourse explanations provide actionable suggestions for applicants on how to change their input features to get a positive evaluation. A crucial yet overlooked phenomenon is that recourse explanations are: When many applicants act according to their recommendations, their collective behavior may shift the data distribution and, once the model is refitted, also the decision boundary. Consequently, the recourse algorithm may render its own recommendations, such that applicants who make the effort of implementing their recommendations may be rejected again when they reapply. In this work, we formally characterize the conditions under which recourse explanations remain valid under their own performative effects. In particular, we prove that recourse actions may become invalid if they are influenced by or if they intervene on non-causal variables. Based on this analysis, we caution against the use of standard counterfactual explanations and causal recourse methods, and instead advocate for recourse methods that recommend actions exclusively on causal variables.


Enhancing Deep Batch Active Learning for Regression with Imperfect Data Guided Selection

Neural Information Processing Systems

Active learning (AL) reduces annotation costs by selecting the most informative samples based on both model sensitivity and predictive uncertainty. While sensitivity can be measured through parameter gradients in an unsupervised manner, predictive uncertainty can hardly be estimated without true labels especially for regression tasks, reducing the informativeness of actively selected samples. This paper proposes the concept of \textit{auxiliary data} to aid the uncertainty estimation for regression tasks. With detailed theoretical analysis, we reveal that auxiliary data, despite potential distribution shifts, can provide a promising uncertainty surrogate when properly weighted. Such finding inspires our design of AGBAL, a novel AL framework that recalibrates auxiliary data losses through density ratio weighting to obtain reliable uncertainty estimates for sample selection. Extensive experiments show that AGBAL consistently outperforms existing approaches without auxiliary data across diverse synthetic and real-world datasets.


InfiniPot-V: Memory-Constrained KV Cache Compression for Streaming Video Understanding

Neural Information Processing Systems

Modern multimodal large language models (MLLMs) can reason over hour-long video, yet their key-value (KV) cache grows linearly with time--quickly exceeding the fixed memory of phones, AR glasses, and edge robots. Prior compression schemes either assume the whole video and user query are available offline or must first build the full cache, so memory still scales with stream length. InfiniPot-V is the first training-free, query-agnostic framework that enforces a hard, length-independent memory cap for \textit{streaming} video understanding. During video encoding it monitors the cache and, once a user-set threshold is reached, runs a lightweight compression pass that (i) removes temporally redundant tokens via Temporal-axis Redundancy (TaR) metric and (ii) keeps semantically significant tokens via Value-Norm (VaN) ranking. Across four open-source MLLMs and four long-video and streaming-video benchmarks, InfiniPot-V cuts peak GPU memory by up to 94\%, sustains real-time generation, and matches or surpasses full-cache accuracy--even in multi-turn dialogues. By dissolving the KV cache bottleneck without retraining or query knowledge, InfiniPot-V closes the gap for on-device streaming video assistants.


Lost in Transmission: When and Why LLMs Fail to Reason Globally

Neural Information Processing Systems

Despite their many successes, transformer-based large language models (LLMs) continue to struggle with tasks that require complex reasoning over large parts of their input. We argue that these failures arise due to capacity limits on the accurate flow of information within LLMs. To formalize this issue, we introduce the bounded attention prefix oracle (BAPO) model, a new computational framework that models bandwidth constraints on attention heads, the mechanism for internal communication in LLMs. We show that several important reasoning problems like graph reachability require high communication bandwidth for BAPOs to solve; we call these problems BAPO-hard. Our experiments corroborate our theoretical predictions: GPT-4o, Claude, and Gemini succeed on BAPO-easy tasks and fail even on relatively small BAPO-hard tasks. BAPOs also reveal another benefit of chain of thought (CoT): we prove that breaking down a task using CoT can turn any BAPO-hard problem into a BAPO-easy one. Our results offer principled explanations for key LLM failures and suggest directions for architectures and inference methods that mitigate bandwidth limits.


State Size Independent Statistical Error Bound for Discrete Diffusion Models

Neural Information Processing Systems

Diffusion models operating in discrete state spaces have emerged as powerful approaches, demonstrating remarkable efficacy across diverse domains, including reasoning tasks and molecular design. Despite their promising applications, the theoretical foundations of these models remain substantially underdeveloped, with the existing literature predominantly focusing on continuous-state diffusion models. A critical gap persists in the theoretical understanding of discrete diffusion modeling: the absence of a rigorous framework for quantifying estimation error with finite data. Consequently, the fundamental question of how precisely one can reconstruct the true underlying distribution from a limited training set remains unresolved. In this work, we analyze the estimation error induced by a score estimation of the discrete diffusion models. One of the main difficulties in the analysis stems from the fact that the cardinality of the state space can be exponentially large with respect to its dimension, which results in an intractable error bound by a naive approach. To overcome this difficulty, we make use of a property that the state space can be smoothly embedded in a continuous Euclidean space that enables us to derive a cardinality independent bound, which is more practical in real applications. In particular, we consider a setting where the state space is structured as a hypercube graph, and another where the induced graph Laplacian can be asymptotically well approximated by the ordinary Laplacian defined on the continuous space, and then derive state space size independent bounds.


MMaDA: Multimodal Large Diffusion Language Models

Neural Information Processing Systems

We introduce MMaDA, a novel class of multimodal diffusion foundation models designed to achieve superior performance across diverse domains such as textual reasoning, multimodal understanding, and text-to-image generation. The approach is distinguished by three key innovations: (i) MMaDA adopts a unified diffusion architecture with a shared probabilistic formulation and a modality-agnostic design, eliminating the need for modality-specific components. This architecture ensures seamless integration and processing across different data types.


Advancing Wasserstein Convergence Analysis of Score-Based Models: Insights from Discretization and Second-Order Acceleration

Neural Information Processing Systems

Score-based diffusion models have emerged as powerful tools in generative modeling, yet their theoretical foundations remain underexplored. In this work, we focus on the Wasserstein convergence analysis of score-based diffusion models. Specifically, we investigate the impact of various discretization schemes, including Euler discretization, exponential integrators, and midpoint randomization methods. Our analysis provides the first quantitative comparison of these discrete approximations, emphasizing their influence on convergence behavior. Furthermore, we explore scenarios where Hessian information is available and propose an accelerated sampler based on the local linearization method.


Guard Me If You Know Me: Protecting Specific Face-Identity from Deepfakes

Neural Information Processing Systems

Securing personal identity against deepfake attacks is increasingly critical in the digital age, especially for celebrities and political figures whose faces are easily accessible and frequently targeted. Most existing deepfake detection methods focus on general-purpose scenarios and often ignore the valuable prior knowledge of known facial identities, e.g., VIP individuals whose authentic facial data are already available.


HumanCrafter: Synergizing Generalizable Human Reconstruction and Semantic 3D Segmentation

Neural Information Processing Systems

Recent advances in generative models have achieved high-fidelity in 3D human reconstruction, yet their utility for specific tasks (e.g., human 3D segmentation) remains constrained. We propose HumanCrafter, a unified framework that enables the joint modeling of appearance and human-part semantics from a single image in a feed-forward manner. Specifically, we integrate human geometric priors in the reconstruction stage and self-supervised semantic priors in the segmentation stage. To address labeled 3D human datasets scarcity, we further develop an interactive annotation procedure for generating high-quality data-label pairs. Our pixel-aligned aggregation enables cross-task synergy, while the multi-task objective simultaneously optimizes texture modeling fidelity and semantic consistency.


Feature-Based Instance Neighbor Discovery: Advanced Stable Test-Time Adaptation in Dynamic World

Neural Information Processing Systems

Despite progress, deep neural networks still suffer performance declines under distribution shifts between training and test domains, leading to a substantial decrease in Quality of Experience (QoE) for applications. Existing test-time adaptation (TTA) methods are challenged by dynamic, multiple test distributions within batches. We observe that feature distributions across different domains inherently cluster into distinct groups with varying means and variances. This divergence reveals a critical limitation of previous global normalization strategies in TTA, which inevitably distort the original data characteristics. Based on this insight, we propose Feature-based Instance Neighbor Discovery (FIND), which comprises three key components: Layer-Wise Feature Disentanglement (LFD), Feature-Aware Batch Normalization (FABN) and Selective FABN (S-FABN). LFD stably captures features with similar distributions at each layer by constructing graph structures; while FABN optimally combines source statistics with test-time distribution-specific statistics for robust feature representation. Finally, S-FABN determines which layers require feature partitioning and which can remain unified, thus enhancing the efficiency of inference. Extensive experiments demonstrate that FIND significantly outperforms existing methods, achieving up to approximately 30\% accuracy improvement in dynamic scenarios while maintaining computational efficiency.