Technology
Diffusion Federated Dataset
Diffusion models have demonstrated decent generation quality, yet their deployment in federated learning scenarios remains challenging. Due to data heterogeneity and a large number of parameters, conventional parameter averaging schemes often fail to achieve stable collaborative training of diffusion models.
Predicting Empirical AIResearch Outcomes with Language Models
Many promising-looking ideas in AI research fail to deliver, but their validation takes substantial human labor and compute. Predicting an idea's chance of success is thus crucial for accelerating empirical AI research, a skill that even expert researchers can only acquire through substantial experience. We build the first benchmark for this task and compare LMs with human experts. Concretely, given two research ideas (e.g., two jailbreaking methods), we aim to predict which will perform better on a set of benchmarks. We scrape ideas and experimental results from conference papers, yielding 1,444 human-verified idea pairs published after our base model's cut-off date for testing, and 6,000 pairs for training. We then develop a system that combines a fine-tuned GPT-4.1 with a paper retrieval agent, and we recruit 25 human experts to compare with. In the NLP domain, our system beats human experts by a large margin (64.4% v.s.
Partition-Then-Adapt: Combating Prediction Bias for Reliable Multi-Modal Test-Time Adaptation
Existing test-time adaptation (TTA) methods primarily focus on scenarios involving domain shifts in a single modality. However, they often prove ineffective when multiple modalities simultaneously undergo domain shifts, as they struggle to identify and utilize reliable samples within testing batches amid severe prediction bias. To address this problem, we propose Partition-Then-Adapt (PTA), a novel approach combating prediction bias for TTA with multi-modal domain shifts. PTA comprises two key components: Partition and Debiased Reweighting (PDR) and multi-modal Attention-Guided Alignment (AGA). Specifically, PDR evaluates each sample's predicted label frequency relative to the batch average, partitioning the batch into potential reliable and unreliable subsets.
What We Miss Matters: Learning from the Overlooked in Point Cloud Transformers
Point Cloud Transformers have become a cornerstone in 3D representation for their ability to model long-range dependencies via self-attention. However, these models tend to overemphasize salient regions while neglecting other informative regions, which limits feature diversity and compromises robustness. To address this challenge, we introduce BlindFormer, a novel contrastive attention learning framework that redefines saliency by explicitly incorporating features typically neglected by the model. The proposed Attentional Blindspot Mining (ABM) suppresses highly attended regions during training, thereby guiding the model to explore its own blind spots. This redirection of attention expands the model's perceptual field and uncovers richer geometric cues.
Learning to Specialize: Joint Gating-Expert Training for Adaptive MoEs in Decentralized Settings
Mixture-of-Experts (MoEs) achieve scalability by dynamically activating subsets of their components. Yet, understanding how expertise emerges through joint training of gating mechanisms and experts remains incomplete, especially in scenarios without clear task partitions. Motivated by inference costs and data heterogeneity, we study how joint training of gating functions and experts can dynamically allocate domain-specific expertise across multiple underlying data distributions. As an outcome of our framework, we develop an instance tailored specifically to decentralized training scenarios, introducing Dynamically Decentralized Orchestration of MoEs or DDOME. DDOME leverages heterogeneity emerging from distributional shifts across decentralized data sources to specialize experts dynamically. By integrating a pretrained common expert to inform a gating function, DDOMEachieves personalized expert subset selection on-the-fly, facilitating just-in-time personalization.
FlexAC: Towards Flexible Control of Associative Reasoning in Multimodal Large Language Models
Multimodal large language models (MLLMs) face an inherent trade-off between faithfulness and creativity, as different tasks require varying degrees of associative reasoning. However, existing methods lack the flexibility to modulate this reasoning strength, limiting MLLMs' adaptability across factual and creative scenarios. To bridge this gap, we propose equipping MLLMs with mechanisms that enable flexible control over associative reasoning. We begin by investigating the internal mechanisms underlying associative behavior in MLLMs and find that: (1) middle layers play a pivotal role in shaping model's associative tendencies, (2) modifying representations in these layers effectively regulates associative reasoning strength, and (3) hallucinations can be exploited to derive steering vectors that guide this modulation. Building on these findings, we introduce Flexible Association Control (FlexAC), a lightweight and training-free framework for modulating associative behavior in MLLMs.
TRACE: Grounding Time Series in Context for Multimodal Embedding and Retrieval
The ubiquity of dynamic data in domains such as weather, healthcare, and energy underscores a growing need for effective interpretation and retrieval of time-series data. These data are inherently tied to domain-specific contexts, such as clinical notes or weather narratives, making cross-modal retrieval essential not only for downstream tasks but also for developing robust time-series foundation models by retrieval-augmented generation (RAG). Despite the increasing demand, time-series retrieval remains largely underexplored. Existing methods often lack semantic grounding, struggle to align heterogeneous modalities, and have limited capacity for handling multi-channel signals. To address this gap, we propose TRACE, a generic multimodal retriever that grounds time-series embeddings in aligned textual context. TRACEenables fine-grained channel-level alignment and employs hard negative mining to facilitate semantically meaningful retrieval.
Degradation-aware Dynamic Schrödinger Bridge for Unpaired Image Restoration
Image restoration is a fundamental task in computer vision and machine learning, which learns a mapping between the clear images and the degraded images under various conditions (e.g., blur, low-light, haze). Yet, most existing image restoration methods are highly restricted by the requirement of degraded and clear image pairs, which limits the generalization and feasibility to enormous real-world scenarios without paired images. To address this bottleneck, we propose a Degradation-aware Dynamic Schrödinger Bridge (DDSB) for unpaired image restoration. Its general idea is to learn a Schrödinger Bridge between clear and degraded image distribution, while at the same time emphasizing the physical degradation priors to reduce the accumulation of errors during the restoration process. ADegradation-aware Optimal Transport (DOT) learning scheme is accordingly devised. Training a degradation model to learn the inverse restoration process is particularly challenging, as it must be applicable across different stages of the iterative restoration process. A Dynamic Transport with Consistency (DTC) learning objective is further proposed to reduce the loss of image details in the early iterations and therefore refine the degradation model. Extensive experiments on multiple image degradation tasks show its state-of-the-art performance over the prior arts.