Goto

Collaborating Authors

 Industry


InfantAgent-Next: AMultimodal Generalist Agent for Automated Computer Interaction

Neural Information Processing Systems

This paper introduces INFANTAGENT-NEXT, a generalist agent capable of interacting with computers in a multimodal manner, encompassing text, images, audio, and video. Unlike existing approaches that either build intricate workflows around a single large model or only provide workflow modularity, our agent integrates tool-based and pure vision agents within a highly modular architecture, enabling different models to collaboratively solve decoupled tasks in a step-by-step manner. Our generality is demonstrated by our ability to evaluate not only pure vision-based real-world benchmarks (i.e., OSWorld), but also more general or tool-intensive benchmarks (e.g., GAIA and SWE-Bench). Specifically, we achieve a 7.27%accuracy gain over Claude-Computer-Use on OSWorld.


From Flat to Hierarchical: Extracting Sparse Representations with Matching Pursuit

Neural Information Processing Systems

Motivated by the hypothesis that neural network representations encode abstract, interpretable features as linearly accessible, approximately orthogonal directions, sparse autoencoders (SAEs) have become a popular tool in interpretability literature. However, recent work has demonstrated phenomenology of model representations that lies outside the scope of this hypothesis, showing signatures of hierarchical, nonlinear, and multi-dimensional features. This raises the question: do SAEs represent features that possess structure at odds with their motivating hypothesis? If not, does avoiding this mismatch help identify said features and gain further insights into neural network representations? To answer these questions, we take a construction-based approach and re-contextualize the popular matching pursuit (MP) algorithm from sparse coding to design MP-SAE--an SAE that unrolls its encoder into a sequence of residual-guided steps, allowing it to capture hierarchical and nonlinearly accessible features. Comparing this architecture with existing SAEs on a mixture of synthetic and natural data settings, we show: (i) hierarchical concepts induce conditionally orthogonal features, which existing SAEs are unable to faithfully capture, and (ii) the nonlinear encoding step of MP-SAE recovers highly meaningful features, helping us unravel shared structure in the seemingly dichotomous representation spaces of different modalities in a vision-language model, hence demonstrating the assumption that useful features are solely linearly accessible is insufficient. We also show that the sequential encoder principle of MPSAE affords an additional benefit of adaptive sparsity at inference time, which may be of independent interest. Overall, we argue our results provide credence to the idea that interpretability should begin with the phenomenology of representations, with methods emerging from assumptions that fit it.


Practical Bayes-Optimal Membership Inference Attacks

Neural Information Processing Systems

We develop practical and theoretically grounded membership inference attacks (MIAs) against both independent and identically distributed (i.i.d.) data and graphstructured data. Building on the Bayesian decision-theoretic framework of [1], we derive the Bayes-optimal membership inference rule for node-level MIAs against graph neural networks, addressing key open questions about optimal query strategies in the graph setting. We introduce BASE and G-BASE, tractable approximations of the Bayes-optimal membership inference. G-BASE achieves superior performance compared to previously proposed classifier-based node-level MIA attacks. BASE, which is also applicable to non-graph data, matches or exceeds the performance of prior state-of-the-art MIAs, such as LiRA and RMIA, at a significantly lower computational cost. Finally, we show that BASE and RMIA are equivalent under a specific hyperparameter setting, providing a principled, Bayes-optimal justification for the RMIA attack.


SongBloom: Coherent Song Generation via Interleaved Autoregressive Sketching and Diffusion Refinement

Neural Information Processing Systems

Generating music with coherent structure, harmonious instrumental and vocal elements remains a significant challenge in song generation. Existing language models and diffusion-based methods often struggle to balance global coherence with local fidelity, resulting in outputs that lack musicality or suffer from incoherent progression and mismatched lyrics. This paper introduces SongBloom1, a novel framework for full-length song generation that leverages an interleaved paradigm of autoregressive sketching and diffusion-based refinement. SongBloom employs an autoregressive diffusion model that combines the high fidelity of diffusion models with the scalability of language models. Specifically, it gradually extends a musical sketch from short to long and refines the details from coarse to fine-grained. The interleaved generation paradigm effectively integrates prior semantic and acoustic context to guide the generation process. Experimental results demonstrate that SongBloom outperforms existing methods across both subjective and objective metrics and achieves performance comparable to the state-of-the-art commercial music generation platforms.



Model Provenance Testing for Large Language Models

Neural Information Processing Systems

Large language models are increasingly customized through fine-tuning and other adaptations, creating challenges in enforcing licensing terms and managing downstream impacts such as protecting intellectual property or identifying vulnerabilities. We address this challenge by developing a framework for testing model provenance. Our approach is based on the key observation that real-world model derivations preserve significant similarities in model outputs that can be detected through statistical analysis. Using only black-box access to models, we employ multiple hypothesis testing to compare model similarities against a baseline established by unrelated models. On two comprehensive real-world benchmarks spanning models from 30M to 4B parameters and comprising over 600 models, our tester achieves 90 95% precision and 80 90% recall in identifying derived models. These results demonstrate the viability of systematic provenance verification in production environments even when only API access is available.


Semi-Supervised Regression with Heteroscedastic Pseudo-Labels

Neural Information Processing Systems

Pseudo-labeling is a commonly used paradigm in semi-supervised learning, yet its application to semi-supervised regression (SSR) remains relatively under-explored. Unlike classification, where pseudo-labels are discrete and confidence-based filtering is effective, SSR involves continuous outputs with heteroscedastic noise, making it challenging to assess pseudo-label reliability. As a result, naive pseudolabeling can lead to error accumulation and overfitting to incorrect labels. To address this, we propose an uncertainty-aware pseudo-labeling framework that dynamically adjusts pseudo-label influence from a bi-level optimization perspective. By jointly minimizing empirical risk over all data and optimizing uncertainty estimates to enhance generalization on labeled data, our method effectively mitigates the impact of unreliable pseudo-labels. We provide theoretical insights and extensive experiments to validate our approach across various benchmark SSR datasets, and the results demonstrate superior robustness and performance compared to existing methods. Our code is available at https://github.com/sxq/HeteroscedasticPseudo-Labels.


23Continual LearningSeparationBinding

Neural Information Processing Systems

However, real-world videos typically exist as continu-ously evolving data streams (e.g., dynamic scenes captured by wearable glasses),necessitating models to continually adapt to shifting data distributions and novelscenarios. Considering the prohibitive computational costs of fine-tuning modelson new tasks, usually, a small subset of parameters is updated while the bulkof the model remains frozen. This poses new challenges to existing continuallearning frameworks in the context of large multimodal foundation models, i.e.,catastrophic forgetting and update conflict. While the foundation models strug-gle with parameter-efficient continual learning, the hippocampus in the humanbrain has evolved highly efficient mechanisms for memory formation and con-solidation. Inspired by the rapid Binding and pattern separation mechanisms inthe hippocampus, in this work, we propose Bisecle for video-language continuallearning, where a multi-directional supervision module is used to capture morecross-modal relationships and a contrastive prompt learning scheme is designedto isolate task-specific knowledge to facilitate efficient memory storage. Bindingand separation processes further strengthen the ability of VLMs to retain complexexperiences, enabling robust and efficient continual learning in video understandingtasks. We perform a thorough evaluation of the proposed Bisecle, demonstratingits ability to mitigate forgetting and enhance cross-task generalization on severalVideoQA benchmarks.


307f375e35616bbc2861033966b44976-Paper-Conference.pdf

Neural Information Processing Systems

W Structural pix idely el-wise adopted Similarity fidelity ev b aluation ut Inde often x Measure metrics fail to capture for and sparse-vie Peak the completeness Signal-to-Noise w CT reconstruction--such of critical Ratio--prioritize anatomical as structures, this limitation, particularly we propose small a or suite thin of re no gions vel anatomy-a that are easily ware missed.


RoME Domain Robust Mixture of Experts for Solution Prediction across Domains

Neural Information Processing Systems

Mixed-Integer Linear Programming (MILP) is a fundamental and powerful framework for modeling complex optimization problems across diverse domains. Recently, learning-based methods have shown great promise in accelerating MILP solvers by predicting high-quality solutions. However, most existing approaches are developed and evaluated in single-domain settings, limiting their ability to generalize to unseen problem distributions. This limitation poses a major obstacle to building scalable and general-purpose learning-based solvers. To address this challenge, we introduce RoME, a domain-Robust Mixture-of-Experts framework for predicting MILP solutions across domains.