Goto

Collaborating Authors

 Country


Saccade Fixation Reiteration with Mamba for Referring Image Segmentation

Neural Information Processing Systems

Referring Image Segmentation (RIS) aims to segment the target object in an image given a natural language expression. While recent methods leverage pre-trained vision backbones and more training corpus to achieve impressive results, they predominantly focus on simple expressions--short, clear noun phrases like "red car" or "left girl". This simplification often reduces RIS to a key word/concept matching problem, limiting the model's ability to handle referential ambiguity in expressions. In this work, we identify two challenging real-world scenarios: object-distracting expressions, which involve multiple entities with contextual cues, and category-implicit expressions, where the object class is not explicitly stated. To address the challenges, we propose a novel framework, SaFiRe, which mimics the human two-phase cognitive process--first forming a global understanding, then refining it through detail-oriented inspection. This is naturally supported by Mamba's scan-then-update property, which aligns with our phased design and enables efficient multi-cycle refinement with linear complexity. We further introduce aRefCOCO, a new benchmark designed to evaluate RIS models under ambiguous referring expressions. Extensive experiments on both standard and proposed datasets demonstrate the superiority of SaFiRe over state-of-the-art baselines.


Mysterious Amazonian 'ghost dog' caught on camera

Popular Science

Environment Animals Pets Dogs Mysterious Amazonian'ghost dog' caught on camera This wild short-eared canine is not your average pup. More information Adding us as a Preferred Source in Google by using this link indicates that you would like to see more of our content in Google News results. The short-eared dog spotted by a camera trap in Bolivia. Breakthroughs, discoveries, and DIY tips sent six days a week. By signing up, you confirm you are 16+, will receive newsletters and promotional content and agree to our Terms of Use and acknowledge the data practices in our Privacy Policy .


Shaping Sequence Attractor Schema in Recurrent Neural Networks

Neural Information Processing Systems

Sequence schemas are abstract, reusable knowledge structures that facilitate rapid adaptation and generalization in novel sequential tasks. In both animals and humans, shaping is an efficient way to acquire such schemas, particularly in complex sequential tasks. As a form of curriculum learning, shaping works by progressively advancing from simple subtasks to integrated full sequences, and ultimately enabling generalization across different task variations. Despite the importance of schemas in cognition and shaping in schema acquisition, the underlying neural dynamics at play remain poorly understood. To explore this, we train recurrent neural networks on an odor-sequence task using a shaping protocol inspired by well-established paradigms in experimental neuroscience. Our model provides the first systematic reproduction of key features of schema learning observed in the orbitofrontal cortex, including rapid adaptation to novel tasks, structured neural representation geometry, and progressive dimensionality compression during learning. Crucially, analysis of the trained RNN reveals that the learned schema is implemented through sequence attractors. These attractor dynamics emerge gradually through the shaping process: starting with isolated discrete attractors in simple tasks, evolving into linked sequences, and eventually abstracting into generalizable attractors that capture shared task structure. Moreover, applying our method to a keyword spotting task shows that shaping facilitates the rapid development of sequence attractor schemas, leading to enhanced learning efficiency.


STITCH-OPE: Trajectory Stitching with Guided Diffusion for Off-Policy Evaluation

Neural Information Processing Systems

Off-policy evaluation (OPE) estimates the performance of a target policy using offline data collected from a behavior policy, and is crucial in domains such as robotics or healthcare where direct interaction with the environment is costly or unsafe. Existing OPE methods are ineffective for high-dimensional, long-horizon problems, due to exponential blow-ups in variance from importance weighting or compounding errors from learned dynamics models. To address these challenges, we propose STITCH-OPE, a model-based generative framework that leverages denoising diffusion for long-horizon OPE in high-dimensional state and action spaces. Starting with a diffusion model pre-trained on the behavior data, STITCHOPE generates synthetic trajectories from the target policy by guiding the denoising process using the score function of the target policy. STITCH-OPE proposes two technical innovations that make it advantageous for OPE: (1) prevents overregularization by subtracting the score of the behavior policy during guidance, and (2) generates long-horizon trajectories by stitching partial trajectories together end-to-end. We provide a theoretical guarantee that under mild assumptions, these modifications result in an exponential reduction in variance versus long-horizon trajectory diffusion.


GPLQ: AGeneral, Practical, and Lightning QAT Method for Vision Transformers

Neural Information Processing Systems

Vision Transformers (ViTs) are essential in computer vision but are computationally intensive, too. Model quantization, particularly to low bit-widths like 4-bit, aims to alleviate this difficulty, yet existing Post-Training Quantization (PTQ) and Quantization-Aware Training (QAT) methods exhibit significant limitations. PTQ often incurs substantial accuracy drop, while QAT achieves high accuracy but suffers from prohibitive computational costs, limited generalization to downstream tasks, training instability, and lack of open-source codebase. To address these challenges, this paper introduces General, Practical, and Lightning Quantization (GPLQ), a novel framework designed for efficient and effective ViT quantization. GPLQ is founded on two key empirical insights: the paramount importance of activation quantization and the necessity of preserving the model's original optimization "basin" to maintain generalization. Consequently, GPLQ employs a sequential "activation-first, weights-later" strategy. Stage 1 keeps weights in FP32 while quantizing activations with a feature mimicking loss in only 1 epoch to keep it in the same "basin", thereby preserving generalization.


Towards Evaluating Proactive Risk Awareness of Multimodal Language Models

Neural Information Processing Systems

Human safety awareness gaps often prevent the timely recognition of everyday risks. In solving this problem, a proactive safety artificial intelligence (AI) system would work better than a reactive one. Instead of just reacting to users' questions, it would actively watch people's behavior and their environment to detect potential dangers in advance. Our Proactive Safety Bench (PaSBench2) evaluates this capability through 416 multimodal scenarios (128 image sequences, 288 text logs) spanning 5 safety-critical domains. Evaluation of 36 advanced models reveals fundamental limitations: Top performers like Gemini-2.5-pro


Tighter CMI-Based Generalization Bounds via Stochastic Projection and Quantization

Neural Information Processing Systems

In this paper, we leverage stochastic projection and lossy compression to establish new conditional mutual information (CMI) bounds on the generalization error of statistical learning algorithms. It is shown that these bounds are generally tighter than the existing ones. In particular, we prove that for certain problem instances for which existing MI and CMI bounds were recently shown in Attias et al. [2024] and Livni [2023] to become vacuous or fail to describe the right generalization behavior, our bounds yield suitable generalization guarantees of the order of O(1/ n), where nis the size of the training dataset. Furthermore, we use our bounds to investigate the problem of data "memorization" raised in those works, and which asserts that there are learning problem instances for which any learning algorithm that has good prediction there exist distributions under which the algorithm must "memorize" a big fraction of the training dataset. We show that for every learning algorithm, there exists an auxiliary algorithm that does not memorize and which yields comparable generalization error for any data distribution. In part, this shows that memorization is not necessary for good generalization.


Revisiting 1-peer exponential graph for enhancing decentralized learning efficiency

Neural Information Processing Systems

For communication-efficient decentralized learning, it is essential to employ dynamic graphs designed to improve the expected spectral gap by reducing deviations from global averaging. The 1-peer exponential graph demonstrates its finite-time convergence property-achieved by maximizing the expected spectral gap-but only when the number of nodes n is a power of two. However, its efficiency across any nand the commutativity of mixing matrices remain unexplored. We delve into the principles underlying the 1-peer exponential graph to explain its efficiency across any nand leverage them to develop new dynamic graphs. We propose two new dynamic graphs: the k-peer exponential graph and the nullcascade graph. Notably, the null-cascade graph achieves finite-time convergence for any nwhile ensuring commutativity. Our experiments confirm the effectiveness of these new graphs, particularly the null-cascade graph, in most test settings.


Layer as Puzzle Pieces: Compressing Large Language Models through Layer Concatenation

Neural Information Processing Systems

Large Language Models excel at natural language processing tasks, but their massive size leads to high computational and storage demands. Recent works have sought to reduce their model size through layer-wise structured pruning. However, they tend to ignore retaining the capabilities in the pruned part. In this work, we re-examine structured pruning paradigms and uncover several key limitations: 1) notable performance degradation due to direct layer removal, 2) incompetent linear weight layer aggregation, and 3) the lack of effective post-training recovery mechanisms. To address these limitations, we propose CoMe, including a progressive layer pruning framework with a Concatenation-based Merging technology and a hierarchical distillation post-training process. Specifically, we introduce a channel sensitivity metric that utilizes activation intensity and weight norms for fine-grained channel selection. Subsequently, we employ a concatenation-based layer merging method to fuse the most critical channels across adjacent layers, enabling progressive model size reduction. Finally, we propose a hierarchical distillation protocol that leverages the correspondences between the original and pruned model layers established during pruning, thereby enabling efficient knowledge transfer. Experiments on seven benchmarks show that CoMe achieves state-of-the-art performance; when pruning 30% of LLaMA-2-7b's parameters, the pruned model retains 83% of its original average accuracy.2


Doubly Robust Alignment for Large Language Models

Neural Information Processing Systems

While RLHF has demonstrated promising results, many algorithms are highly sensitive to misspecifications in the underlying preference model (e.g., the Bradley-Terry model), the reference policy, or the reward function, resulting in undesirable fine-tuning. To address model misspecification, we propose a doubly robust preference optimization algorithm that remains consistent when either the preference model or the reference policy is correctly specified (without requiring both). Our proposal demonstrates superior and more robust performance than state-of-the-art algorithms, both in theory and in practice.