Goto

Collaborating Authors

 effectiveness


Continual Optimization with Symmetry Teleportation for Multi-Task Learning

Neural Information Processing Systems

Multi-task learning (MTL) is a widely explored paradigm that enables the simultaneous learning of multiple tasks using a single model. Despite numerous solutions, the key issues of optimization conflict and task imbalance remain under-addressed, limiting performance. Unlike existing optimization-based approaches that typically reweight task losses or gradients to mitigate conflicts or promote progress, we propose a novel approach based on Continual Optimization with Symmetry Teleportation (COST). During MTL optimization, when an optimization conflict arises, we seek an alternative loss-equivalent point on the loss landscape to reduce conflict. Specifically, we utilize a low-rank adapter (LoRA) to facilitate this practical teleportation by designing convergent, loss-invariant objectives. Additionally, we introduce a historical trajectory reuse strategy to continually leverage the benefits of advanced optimizers. Extensive experiments on multiple mainstream datasets demonstrate the effectiveness of our approach. COSTis a plug-and-play solution that enhances a wide range of existing MTL methods. When integrated with state-of-the-art methods, COSTachieves superior performance.


Vanish into Thin Air: Cross-prompt Universal Adversarial Attacks for SAM2

Neural Information Processing Systems

Recent studies reveal the vulnerability of the image segmentation foundation model SAM to adversarial examples. Its successor, SAM2, has attracted significant attention due to its strong generalization capability in video segmentation. However, its robustness remains unexplored, and it is unclear whether existing attacks on SAM can be directly transferred to SAM2. In this paper, we first analyze the performance gap of existing attacks between SAM and SAM2 and highlight two key challenges arising from their architectural differences: directional guidance from the prompt and semantic entanglement across consecutive frames. To address these issues, we propose UAP-SAM2, the first cross-prompt universal adversarial attack against SAM2 driven by dual semantic deviation. For cross-prompt transferability, we begin by designing a target-scanning strategy that divides each frame into k regions, each randomly assigned a prompt, to reduce prompt dependency during optimization.


SILENCER: From Discovery to Mitigation of Self-Bias in LLM-as-Benchmark-Generator

Neural Information Processing Systems

LLM-as-Benchmark-Generator methods have been widely studied as a supplement to human annotators for scalable evaluation, while the potential biases within this paradigm remain underexplored. In this work, we systematically define and validate the phenomenon of inflated performance in models evaluated on their self-generated benchmarks, referred to as self-bias, and attribute it to sub-biases arising from question domain, language style, and wrong labels. On this basis, we propose SILENCER, a general framework that leverages the heterogeneity between multiple generators at both the sample and benchmark levels to neutralize bias and generate high-quality, self-bias-silenced benchmark. Experimental results across various settings demonstrate that SILENCER can suppress self-bias to near zero, significantly improve evaluation effectiveness of the generated benchmark (with an average improvement from 0.655 to 0.833 in Pearson correlation with high-quality human-annotated benchmark), while also exhibiting strong generalizability.


baf0fab890edc9dce805d7c518058712-Paper-Conference.pdf

Neural Information Processing Systems

Large Vision-Language Models (VLMs) have achieved remarkable success in understanding complex real-world scenarios and supporting data-driven decisionmaking processes. However, VLMs exhibit significant vulnerability against adversarial examples, either text or image, which can lead to various adversarial outcomes, e.g., jailbreaking, hijacking, and hallucination, etc. In this work, we empirically and theoretically demonstrate that VLMs are particularly susceptible to image-based adversarial examples, where imperceptible perturbations can precisely manipulate each output token. To this end, we propose a novel attack called Visionlanguage model Manipulation Attack (VMA), which integrates first-order and second-order momentum optimization techniques with a differentiable transformation mechanism to effectively optimize the adversarial perturbation.


Discrete Neural Flow Samplers with Locally Equivariant Transformer

Neural Information Processing Systems

Sampling from unnormalised discrete distributions is a fundamental problem across various domains. While Markov chain Monte Carlo offers a principled approach, it often suffers from slow mixing and poor convergence. In this paper, we propose Discrete Neural Flow Samplers (DNFS), a trainable and efficient framework for discrete sampling. DNFS learns the rate matrix of a continuous-time Markov chain such that the resulting dynamics satisfy the Kolmogorov equation. As this objective involves the intractable partition function, we then employ control variates to reduce the variance of its Monte Carlo estimation, leading to a coordinate descent learning algorithm. To further facilitate computational efficiency, we propose locally equivaraint Transformer, a novel parameterisation of the rate matrix that significantly improves training efficiency while preserving powerful network expressiveness. Empirically, we demonstrate the efficacy of DNFS in a wide range of applications, including sampling from unnormalised distributions, training discrete energy-based models, and solving combinatorial optimisation problems.


Analogy-based Multi-Turn Jailbreak against Large Language Models

Neural Information Processing Systems

Large language models (LLMs) are inherently designed to support multi-turn interactions, which opens up new possibilities for jailbreak attacks that unfold gradually and potentially bypass safety mechanisms more effectively than singleturn attacks. However, current multi-turn jailbreak methods are still in their early stages and suffer from two key limitations. First, they all inherently require inserting sensitive phrases into the context, which makes the dialogue appear suspicious and increases the likelihood of rejection, undermining the effectiveness of the attack. Second, even when harmful content is generated, the response often fails to align with the malicious prompt due to semantic drift, where the conversation slowly moves away from its intended goal. To address these challenges, we propose an analogy-based black-box multi-turn jailbreak framework that constructs fully benign contexts to improve attack success rate while ensuring semantic alignment with the malicious intent. The method first guides the model through safe tasks that mirror the response structure of the malicious prompt, enabling it to internalize the format without exposure to sensitive content. A controlled semantic shift is then introduced in the final turn, substituting benign elements with malicious ones while preserving structural coherence. Experiments on six commercial and open-source LLMs, two benchmark datasets show that our method significantly improves attack performance, achieving an average attack success rate of 93.3% and outperforming five competitive baselines. Our code is released at AMA. WARNING: This paper contains potentially unsafe examples.


ab6eba9a853087993addff937c8cec87-Paper-Conference.pdf

Neural Information Processing Systems

Spatiotemporal trajectory data is crucial for various traffic-related applications. However, issues such as device malfunctions and network instability often result in sparse trajectories that lose detailed movement information compared to their dense counterparts. Recovering missing points in sparse trajectories is thus essential. Despite recent progress, three challenges remain. First, the lack of large-scale dense trajectory datasets hinders the training of a trajectory recovery model. Second, the varying spatiotemporal correlations in sparse trajectories make it hard to generalize across different sampling intervals.


Token Bottleneck: One Token to Remember Dynamics

Neural Information Processing Systems

Deriving compact and temporally aware visual representations from dynamic scenes is essential for successful execution of sequential scene understanding tasks such as visual tracking and robotic manipulation. In this paper, we introduce Token Bottleneck (ToBo), a simple yet intuitive self-supervised learning pipeline that squeezes a scene into a bottleneck token and predicts the subsequent scene using minimal patches as hints.


PhysDiff: APhysically-Guided Diffusion Model for Multivariate Time Series Anomaly Detection

Neural Information Processing Systems

Unsupervised anomaly detection of multivariate time series remains challenging in complex non-stationary dynamics, due to the high false-positive rates and limited interpretability. We propose PhysDiff, combining physics-guided decomposition with diffusion-based reconstruction, to address these issues. The physics-guided signal decomposition is introduced to disentangle overlapping dynamics by isolating high frequency oscillations and low frequency trends, which can reduce interference and provide meaningful physical priors. The reconstruction through conditional diffusion modeling captures deviations from learned normal behavior, making anomalies more distinguishable. Notably, PhysDiff introduces an amplitude-sensitive permutation entropy criterion to adaptively determine the optimal decomposition depth, and automatically extract adaptive frequency components used as explicit physics-based constraints for the diffusion process. Furthermore, the proposed conditional diffusion network employs a dual-path conditioning mechanism that integrates high-frequency and low-frequency physical priors, dynamically regulating the denoising process via a novel time frequency energy routing mechanism. By weighting reconstruction errors across frequency bands, our method improves anomaly localization and enhances interpretability. Extensive experiments on five benchmark datasets and two NeurIPS-TS scenarios demonstrate that PhysDiff outperforms 18 state-of-the-art baselines, with average F1 score improvements on both standard and challenging datasets.


9796170d31d42b943534df40bdee68d3-Paper-Conference.pdf

Neural Information Processing Systems

Large Vision-Language Models (LVLMs) are susceptible to hallucinations, where generated responses seem semantically plausible yet exhibit little or no relevance to the input image. Previous studies reveal that this issue primarily stems from LVLMs' over-reliance on language priors while disregarding the visual information during decoding. To alleviate this issue, we introduce a novel Conditional Pointwise Mutual Information (C-PMI) calibrated decoding strategy, which adaptively strengthens the mutual dependency between generated texts and input images to mitigate hallucinations. Unlike existing methods solely focusing on text token sampling, we propose to jointly model the contributions of visual and textual tokens to C-PMI, formulating hallucination mitigation as a bi-level optimization problem aimed at maximizing mutual information. To solve it, we design a token purification mechanism that dynamically regulates the decoding process by sampling text tokens remaining maximally relevant to the given image, while simultaneously refining image tokens most pertinent to the generated response. Extensive experiments across various benchmarks reveal that the proposed method significantly reduces hallucinations in LVLMs while preserving decoding efficiency.