Goto

Collaborating Authors

 Genre


FedWMSAM: Fast and Flat Federated Learning via Weighted Momentum and Sharpness-Aware Minimization

Neural Information Processing Systems

These twin requirements have naturally led to two widely used techniques: client/server momentum to accelerate progress, and sharpness-aware minimization (SAM) to prefer flat solutions. However, simply combining momentum and SAM leaves two structural issues unresolved in non-IIDFL. We identify and formalize two failure modes: local-global curvature misalignment (local SAM directions need not reflect the global loss geometry) and momentum-echo oscillation (late-stage instability caused by accumulated momentum). To our knowledge, these failure modes have not been jointly articulated and addressed in the FL literature. We propose FedWMSAM to address both failure modes. First, we construct a momentum-guided global perturbation from server-aggregated momentum to align clients' SAM directions with the global descent geometry, enabling a singlebackprop SAM approximation that preserves efficiency.


Alternating Gradient Flows: ATheory of Feature Learning in Two-layer Neural Networks

Neural Information Processing Systems

What features neural networks learn, and how, remains an open question. In this paper, we introduce Alternating Gradient Flows (AGF), an algorithmic framework that describes the dynamics of feature learning in two-layer networks trained from small initialization. Prior works have shown that gradient flow in this regime exhibits a staircase-like loss curve, alternating between plateaus where neurons slowly align to useful directions and sharp drops where neurons rapidly grow in norm. AGF approximates this behavior as an alternating two-step process: maximizing a utility function over dormant neurons and minimizing a cost function over active ones. AGF begins with all neurons dormant.


VIDEORFT: Incentivizing Video Reasoning Capability in MLLMs via Reinforced Fine-Tuning

Neural Information Processing Systems

Reinforcement fine-tuning (RFT) has shown great promise in achieving humanlevel reasoning capabilities of Large Language Models (LLMs), and has recently been extended to MLLMs. Nevertheless, reasoning about videos, which is a fundamental aspect of human intelligence, remains a persistent challenge due to the complex logic, temporal and causal structures inherent in video data. To fill this gap, we propose VIDEORFT, a novel approach that extends the RFT paradigm to cultivate human-like video reasoning capabilities in MLLMs. VIDEORFT follows the standard two-stage scheme in RFT: supervised fine-tuning (SFT) with chain-of-thought (CoT) annotations, followed by reinforcement learning (RL) to improve generalization. A central challenge to achieve this in the video domain lies in the scarcity of large-scale, high-quality video CoT datasets. We address this by building a multi-expert-driven, cognition-inspired CoT curation pipeline. First, we devise a cognition-inspired prompting strategy to elicit a reasoning LLM to generate preliminary CoTs based solely on rich, structured, and literal representations of video content. Subsequently, these CoTs are revised by a MLLM conditioned on the actual video, ensuring visual consistency and reducing visual hallucinations.


AVERIMATEC: ADataset for Automatic Verification of Image-Text Claims with Evidence from the Web

Neural Information Processing Systems

Textual claims are often accompanied by images to enhance their credibility and spread on social media, but this also raises concerns about the spread of misinformation. Existing datasets for automated verification of image-text claims remain limited, as they often consist of synthetic claims and lack evidence annotations to capture the reasoning behind the verdict. In this work, we introduce AVERIMATEC, a dataset consisting of 1,297 real-world image-text claims. Each claim is annotated with question-answer (QA) pairs containing evidence from the web, reflecting a decomposed reasoning regarding the verdict. We mitigate common challenges in fact-checking datasets such as contextual dependence, temporal leakage, and evidence insufficiency, via claim normalization, temporally constrained evidence annotation, and a two-stage sufficiency check. We assess the consistency of the annotation in AVERIMATEC via inter-annotator studies, achieving a ฮบ = 0.742 on verdicts and 74.7% consistency on QA pairs. We also propose a novel evaluation method for evidence retrieval and conduct extensive experiments to establish baselines for verifying image-text claims using open-web evidence.


06872e1e6d11baf2ae27285c50132f4f-Paper-Conference.pdf

Neural Information Processing Systems

Large language models (LLMs) suffer from forgetting of upstream knowledge when fine-tuned. Despite efforts on mitigating forgetting, few have investigated how forgotten upstream examples are dependent on newly learned tasks. Insights on such dependencies enable efficient and targeted mitigation of forgetting. In this paper, we empirically analyze forgetting that occurs in N upstream examples of language modeling or instruction-tuning after fine-tuning LLMs on one of M new tasks, visualized in M N matrices. We show that the matrices are often well-approximated with low-rank matrices, indicating the dominance of simple associations between the learned tasks and forgotten upstream examples. Leveraging the analysis, we predict forgetting of upstream examples when fine-tuning LLMs on unseen tasks with matrix completion over the empirical associations. This enables fast identification of most forgotten examples without expensive inference on the entire upstream data. Despite simplicity, the approach outperforms prior approaches that learn semantic relationships of learned tasks and upstream examples with LMs. We demonstrate the practical utility of our analysis by showing statistically significantly reduced forgetting as we upweight predicted examples for replay during fine-tuning.


EfficientNav: Towards On-Device Object-Goal Navigation with Navigation Map Caching and Retrieval

Neural Information Processing Systems

Embodied agents equipped with large language models (LLMs) and online constructed navigation maps can perform ObjNav in a zero-shot manner. However, existing agents heavily rely on giant LLMs on the cloud, e.g., GPT-4, while directly switching to small LLMs, e.g., LLaMA3.2-11b,


BioCG: Constrained Generative Modeling for Biochemical Interaction Prediction

Neural Information Processing Systems

Predicting interactions between biochemical entities is a core challenge in drug discovery and systems biology, often hindered by limited data and poor generalization to unseen entities. Traditional discriminative models frequently underperform in such settings. We propose BioCG (Biochemical Constrained Generation), a novel framework that reformulates interaction prediction as a constrained sequence generation task. BioCG encodes target entities as unique discrete sequences via Iterative Residual Vector Quantization (I-RVQ) and trains a generative model to produce the sequence of an interacting partner given a query entity. A trie-guided constrained decoding mechanism, built from a catalog of valid target sequences, concentrates the model's learning on the critical distinctions between valid biochemical options, ensuring all outputs correspond to an entity within the pre-defined target catalog. An information-weighted training objective further focuses learning on the most critical decision points. BioCG achieves state-of-the-art (SOTA) performance across diverse tasks, Drug-Target Interaction (DTI), Drug-Drug Interaction (DDI), and Enzyme-Reaction Prediction, especially in data-scarce and cold-start conditions.


066c5542795287822f4f076cf330e5a2-Paper-Conference.pdf

Neural Information Processing Systems

Starting work In that from this emplo ef paper ficient ys, we a seed self-impro introduce demonstrations ving DexFlyWheel, cycle w to armup, continuously a De scalable xFlyWheel enrich data e pipeline ing xpands (RL), that the rollout datase integrates trajectory t through Imitation collection, iterati Learning ve cycles.


SEGA: Shaping Semantic Geometry for Robust Hashing under Noisy Supervision

Neural Information Processing Systems

This paper studies the problem of learning hash codes from noisy supervision, which is a practical yet challenging task. This problem is important in extensive real-world applications such as image retrieval and cross-modal retrieval. However, most of the existing methods focus on label denoising to address this problem, but ignore the geometric structure of the hash space, which is critical for learning stable hash codes. Towards this end, this paper proposes a novel framework named Semantic Geometry Shaping (SEGA) that explicitly refines the semantic geometry of hash space. Specifically, we first learn dynamic class prototypes as semantic anchors and cluster hash embeddings around these prototypes to keep structural stability. We then leverage both the energy of predicted distributions and structure-based divergence to estimate the uncertainty of instances and calibrate the supervision in a soft manner. Moreover, we introduce structure-aware interpolation to improve the class boundaries. To verify the effectiveness of our design, we give the theoretical analysis for the proposed framework. Experiments on a range of widely-used retrieval datasets justify the superiority of our SEGA over extensive strong baselines under noisy supervision.


Jacobian-Based Interpretation of Nonlinear Neural Encoding Model

Neural Information Processing Systems

In recent years, the alignment between artificial neural network (ANN) embeddings and blood oxygenation level dependent (BOLD) responses in functional magnetic resonance imaging (fMRI) via neural encoding models has significantly advanced research on neural representation mechanisms and interpretability in the brain. However, these approaches remain limited in characterizing the brain's inherently nonlinear response properties. To address this, we propose the Jacobianbased Nonlinearity Evaluation (JNE), an interpretability metric for nonlinear neural encoding models. JNE quantifies nonlinearity by statistically measuring the dispersion of local linear mappings (Jacobians) from model representations to predicted BOLD responses, thereby approximating the nonlinearity of BOLD signals. Centered on proposing JNE as a novel interpretability metric, we validated its effectiveness through controlled simulation experiments on various activation functions and network architectures, and further verified it on real fMRI data, demonstrating a hierarchical progression of nonlinear characteristics from primary to higher-order visual cortices, consistent with established cortical organization. We further extended JNE with Sample-Specificity (JNE-SS), revealing stimulus-selective nonlinear response patterns in functionally specialized brain regions. As the first interpretability metric for quantifying nonlinear responses, JNE provides new insights into brain information processing.