Goto

Collaborating Authors

Uncertainty Quantification for Physics-Informed Neural Networks with Extended Fiducial Inference

Neural Information Processing Systems

Uncertainty quantification (UQ) in scientific machine learning is increasingly critical as neural networks are widely adopted to tackle complex problems across diverse scientific disciplines. For physics-informed neural networks (PINNs), a prominent model in scientific machine learning, uncertainty is typically quantified using Bayesian or dropout methods. However, both approaches suffer from a fundamental limitation: the prior distribution or dropout rate required to construct honest confidence sets cannot be determined without additional information. In this paper, we propose a novel method within the framework of extended fiducial inference (EFI) to provide rigorous uncertainty quantification for PINNs. The proposed method leverages a narrow-neck hyper-network to learn the parameters of the PINN and quantify their uncertainty based on imputed random errors in the observations. This approach overcomes the limitations of Bayesian and dropout methods, enabling the construction of honest confidence sets based solely on observed data. This advancement represents a significant breakthrough for PINNs, greatly enhancing their reliability, interpretability, and applicability to real-world scientific and engineering challenges. Moreover, it establishes a new theoretical framework for EFI, extending its application to large-scale models, eliminating the need for sparse hyper-networks, and significantly improving the automaticity and robustness of statistical inference.


Controlling the Flow: Stability and Convergence for Stochastic Gradient Descent with Decaying Regularization

Neural Information Processing Systems

The present article studies the minimization of convex, $L$-smooth functions defined on a separable real Hilbert space. We analyze regularized stochastic gradient descent (reg-SGD), a variant of stochastic gradient descent that uses a Tikhonov regularization with time-dependent, vanishing regularization parameter. We prove strong convergence of reg-SGD to the minimum-norm solution of the original problem without additional boundedness assumptions. Moreover, we quantify the rate of convergence and optimize the interplay between step-sizes and regularization decay. Our analysis reveals how vanishing Tikhonov regularization controls the flow of SGD and yields stable learning dynamics, offering new insights into the design of iterative algorithms for convex problems, including those that arise in ill-posed inverse problems.


VITA-1.5: Towards GPT-4o Level Real-Time Vision and Speech Interaction

Neural Information Processing Systems

Recent Multimodal Large Language Models (MLLMs) have typically focused on integrating visual and textual modalities, with less emphasis placed on the role of speech in enhancing interaction. However, speech plays a crucial role in multimodal dialogue systems, and implementing high-performance in both vision and speech tasks remains a challenge due to the fundamental modality differences. In this paper, we propose a carefully designed multi-stage training methodology that progressively trains LLM to understand both visual and speech information, ultimately enabling fluent vision and speech interaction. Our approach not only preserves strong vision-language capacity, but also enables efficient speech-to-speech dialogue capabilities without separate ASR and TTS modules, significantly accelerating multimodal end-to-end response speed. By comparing against state-of-the-art counterparts across benchmarks for image, video, and speech, we demonstrate that our omni model is equipped with both strong visual and speech capabilities, making omni understanding and interaction.


Weak-shot Keypoint Estimation via Keyness and Correspondence Transfer

Neural Information Processing Systems

Keypoint estimation is a fundamental task in computer vision, but generally requires large-scale annotated data for training. Few-shot and unsupervised keypoint estimation are prevalent economical paradigms, but the former still requires annotations for extensive novel classes while the latter only supports for single class. In this paper, we focus on the task of weak-shot keypoint estimation, where multiple novel classes are learned from unlabeled images with the help of labeled base classes. The key problem is what to transfer from base classes to novel classes, and we propose to transfer keyness and correspondence, which essentially belong to comparing entities and thus are class-agnostic and class-wise transferable. The keyness compares which pixel in the local region is more key, which can guide the keypoints of novel classes to move towards the local maximum (i.e., obtaining keypoints). The correspondence compares whether the two pixels belongs to the same semantic part, which can activate the keypoints of novel classes by reinforcing the consistency between corresponding points on two paired images. By transferring keyness and correspondence, our framework achieves favourable performance for weak-shot keypoint estimation. Extensive experiments and analyses on large-scale benchmark MP-100 demonstrate our effectiveness.


MALinZero: Efficient Low-Dimensional Search for Mastering Complex Multi-Agent Planning

Neural Information Processing Systems

Monte Carlo Tree Search (MCTS), which leverages Upper Confidence Bound for Trees (UCTs) to balance exploration and exploitation through randomized sampling, is instrumental to solving complex planning problems. However, for multi-agent planning, MCTS is confronted with a large combinatorial action space that often grows exponentially with the number of agents. As a result, the branching factor of MCTS during tree expansion also increases exponentially, making it very difficult to efficiently explore and exploit during tree search. To this end, we propose MALinZero, a new approach to leverage low-dimensional representational structures on joint-action returns and enable efficient MCTS in complex multi-agent planning. Our solution can be viewed as projecting the joint-action returns into the low-dimensional space representable using a contextual linear bandit problem formulation. We solve the contextual linear bandit problem with convex and $\mu$-smooth loss functions -- in order to place more importance on better joint actions and mitigate potential representational limitations -- and derive a linear Upper Confidence Bound applied to trees (LinUCT) to enable novel multi-agent exploration and exploitation in the low-dimensional space. We analyze the regret of MALinZero for low-dimensional reward functions and propose an $(1-\tfrac1e)$-approximation algorithm for the joint action selection by maximizing a sub-modular objective. MALinZero demonstrates state-of-the-art performance on multi-agent benchmarks such as matrix games, SMAC, and SMACv2, outperforming both model-based and model-free multi-agent reinforcement learning baselines with faster learning speed and better performance.


Fit the Distribution: Cross-Image/Prompt Adversarial Attacks on Multimodal Large Language Models

Neural Information Processing Systems

Although Multimodal Large Language Models (MLLMs) have demonstrated remarkable achievements in recent years, they remain vulnerable to adversarial examples that result in harmful responses. Existing attacks typically focus on optimizing adversarial perturbations for a certain multimodal image-prompt pair or fixed training dataset, which often leads to overfitting. Consequently, these perturbations fail to remain malicious once transferred to attack unseen image-prompt pairs, suffering from significant resource costs to cover the diverse multimodal inputs in complicated real-world scenarios. To alleviate this issue, this paper proposes a novel adversarial attack on MLLMs based on distribution approximation theory, which models the potential image-prompt input distribution and adds the same distribution-fitting adversarial perturbation on multimodal input pairs to achieve effective cross-image/prompt transfer attacks. Specifically, we exploit the Laplace approximation to model the Gaussian distribution of the image and prompt inputs for the MLLM, deriving an estimate of the mean and covariance parameters. By sampling from this approximated distribution with Monte Carlo mechanism, we efficiently optimize and fit a single input agnostic perturbation over diverse image prompt pairs, yielding strong universality and transferability. Extensive experiments are conducted to verify the strong adversarial capabilities of our proposed attack against prevalent MLLMs spanning a spectrum of images/prompts.


Tiled Flash Linear Attention: More Efficient Linear RNN and xLSTM Kernels

Neural Information Processing Systems

Linear RNNs with gating recently demonstrated competitive performance compared to Transformers in language modeling. Although their linear compute scaling in sequence length offers theoretical runtime advantages over Transformers, realizing these benefits in practice requires optimized custom kernels, as Transformers rely on the highly efficient Flash Attention kernels (Dao, 2024). Leveraging the chunkwise-parallel formulation of linear RNNs, Flash Linear Attention (FLA) (Yang & Zhang, 2024) shows that linear RNN kernels are faster than Flash Attention, by parallelizing over chunks of the input sequence. However, since the chunk size of FLA is limited, many intermediate states must be materialized in GPU memory. This leads to low arithmetic intensity and causes high memory consumption and IO cost, especially for long-context pre-training. In this work, we present Tiled Flash Linear Attention (TFLA), a novel kernel algorithm for linear RNNs, that enables arbitrary large chunk sizes and high arithmetic intensity by introducing an additional level of sequence parallelization within each chunk. First, we apply TFLA to the xLSTM with matrix memory, the mLSTM (Beck et al., 2024). Second, we propose an mLSTM variant with sigmoid input gate and reduced computation for even faster kernel runtimes at equal language modeling performance. In our speed benchmarks, we show that our new mLSTM kernels based on TFLA outperform highly optimized Flash Attention, Linear Attention and Mamba kernels, setting a new state of the art for efficient long-context sequence modeling primitives.


StelLA: Subspace Learning in Low-rank Adaptation using Stiefel Manifold

Neural Information Processing Systems

Low-rank adaptation (LoRA) has been widely adopted as a parameter-efficient technique for fine-tuning large-scale pre-trained models. However, it still lags behind full fine-tuning in performance, partly due to its insufficient exploitation of the geometric structure underlying low-rank manifolds. In this paper, we propose a geometry-aware extension of LoRA that uses a three-factor decomposition $USV^\top$.


Tackling Biased Evaluators in Dueling Bandits

Neural Information Processing Systems

In dueling bandits, an agent explores and exploits choices (i.e., arms) by learning from their stochastic feedback in the form of relative preferences. Prior related studies focused on unbiased feedback. In practice, however, the feedback provided by evaluators can be biased. For example, human users are likely to provide biased evaluation towards large language models due to their heterogeneous background. In this work, we aim to minimize the regret in dueling bandits considering evaluators' biased feedback.


IneqSearch: Hybrid Reasoning for Olympiad Inequality Proofs

Neural Information Processing Systems

Mathematicians have long employed decomposition techniques to prove inequalities, yet automating this process remains a significant challenge in computational mathematics. We introduce IneqSearch, a hybrid reasoning system that integrates symbolic computation with large language models (LLMs) to address this challenge. IneqSearch reformulates inequality proving as a structured search problem: identifying appropriate combinations of theorems that decompose expressions into non-negative components. The system combines a symbolic solver for deductive reasoning with an LLM-based agent for constructive proof exploration, effectively implementing methodologies observed in formal mathematical practice. A key contribution of IneqSearch is its iterative learning mechanism that systematically incorporates newly proven results into its theorem database, enabling knowledge acquisition during practice that enhances its capabilities without requiring human intervention. In empirical evaluation on 437 Olympiad-level inequalities, IneqSearch successfully proves 342 problems, significantly outperforming existing methods and demonstrating the effectiveness of integrating symbolic and neural approaches for mathematical reasoning.