Technology
Supplementary Materials for MUVR: AMulti-Modal Untrimmed Video Retrieval Benchmark with Multi-Level Visual Correspondence Anonymous Author(s) Affiliation Address email
In this supplementary material, we elaborate on the MLLMs prompting details in Section 1. We1 further illustrate the annotation instructions in Section 2. Then, some visualization examples are2 provided in Section 3. Limitations and social impact are introduced in Section 4.3 The evaluation prompts for MLLMs are listed in Table 1 and 2. Although we attempted to maintain5 consistency across models, slight variations were necessary due to differing prompting requirements.6 We take the relationship annotation of9 the News partition as an example, while other partitions have different visual correspondences.10 3 Visualization11 Figure 1, 2, 3, 4 and 5 provide several relevant examples of different partitions from MUVR, with a12 text description of the query video and the tag of each video.13 MUVR relies on human annotators to annotate videos with rich semantics.
MUVR: AMulti-Modal Untrimmed Video Retrieval Benchmark with Multi-Level Visual Correspondence
We propose the Multi-modal Untrimmed Video Retrieval task, along with a new benchmark (MUVR) to advance video retrieval for long-video platforms. MUVR aims to retrieve untrimmed videos containing relevant segments using multi-modal queries. It has the following features: 1) Practical retrieval paradigm: MUVR supports video-centric multi-modal queries, expressing fine-grained retrieval needs through long text descriptions, video tag prompts, and mask prompts. It adopts a one-to-many retrieval paradigm and focuses on untrimmed videos, tailored for long-video platform applications.
Hyperbolic Fine-Tuning for Large Language Models
Large language models (LLMs) have demonstrated remarkable performance across various tasks. However, it remains an open question whether the default Euclidean space is the most suitable choice for LLMs. In this study, we investigate the geometric characteristics of LLMs, focusing specifically on tokens and their embeddings. Our findings reveal that token frequency follows a power-law distribution, where high-frequency tokens (e.g., "the," "that") constitute the minority, while low-frequency tokens (e.g., "apple," "dog") constitute the majority. Furthermore, high-frequency tokens cluster near the origin, whereas low-frequency tokens are positioned farther away in the embedding space.
Rethinking Gradient Step Denoiser: Towards Truly Pseudo-Contractive Operator
Learning pseudo-contractive denoisers is a fundamental challenge in the theoretical analysis of Plug-and-Play (PnP) methods and the Regularization by Denoising (RED) framework. While spectral methods attempt to address this challenge using the power iteration method, they fail to guarantee the truly pseudo-contractive property and suffer from high computational complexity. In this work, we rethink gradient step (GS) denoisers and establish a theoretical connection between GS denoisers and pseudo-contractive operators. We show that GS denoisers, with the gradients of convex potential functions parameterized by input convex neural networks (ICNNs), can achieve truly pseudo-contractive properties. Furthermore, we integrate the learned truly pseudo-contractive denoiser into the RED-PRO (RED via fixed-point projection) model, definitely ensuring convergence in terms of both iterative sequences and objective functions. Extensive numerical experiments confirm that the learned GS denoiser satisfies the truly pseudo-contractive property and, when integrated into RED-PRO, provides a favorable trade-off between interpretability and empirical performance on inverse problems.
Fire360: ABenchmark for Robust Perception and Episodic Memory in Degraded 360 Firefighting Video
Modern AI systems struggle most in environments where reliability is criticalscenes with smoke, poor visibility, and structural deformation. Each year, tens of thousands of firefighters are injured on duty, often due to breakdowns in situational perception [35]. We introduce Fire360, a benchmark for evaluating perception and reasoning in safety-critical firefighting scenarios. The dataset includes 228 360 videos from professional training sessions under diverse conditions (e.g., low light, thermal distortion), annotated with action segments, object locations, and degradation metadata. Fire360 supports five tasks: Visual Question Answering, Temporal Action Captioning, Object Localization, Safety-Critical Reasoning, and Transformed Object Retrieval (TOR). TOR tests whether models can match pristine exemplars to fire-damaged counterparts in unpaired scenes, evaluating episodic memory under irreversible visual transformations. While human experts achieve 83.5% on TOR, models like GPT-4o lag significantly, exposing failures in reasoning under degradation. By releasing Fire360 and its evaluation suite, we aim to advance models that not only see, but also remember, reason, and act under uncertainty.
MetaKoopman: Bayesian Meta-Learning of Koopman Operators for Modeling Structured Dynamics under Distribution Shifts
Modeling and forecasting nonlinear dynamics under distribution shifts is essential for robust decision-making in real-world systems. In this work, we propose MetaKoopman, a Bayesian meta-learning framework for modeling nonlinear dynamics through linear latent representations. MetaKoopman learns a Matrix Normal-Inverse Wishart (MNIW) prior over the Koopman operator, enabling closed-form Bayesian updates conditioned on recent trajectory segments. Moreover, it provides a closed-form posterior predictive distribution over future state trajectories, capturing both epistemic and aleatoric uncertainty in the learned dynamics. We evaluate MetaKoopman on a full-scale autonomous truck and trailer system across a wide range of adverse winter scenarios--including snow, ice, and mixed-friction conditions--as well as in simulated control tasks with diverse distribution shifts. MetaKoopman consistently outperforms prior approaches in multi-step prediction accuracy, uncertainty calibration and robustness to distributional shifts. Field experiments further demonstrate its effectiveness in dynamically feasible motion planning, particularly during evasive maneuvers and operation at the limits of traction.
Memorization in Graph Neural Networks
Deep neural networks (DNNs) have been shown to memorize their training data, but similar analyses for graph neural networks (GNNs) remain under-explored. We introduce NCMemo(Node Classification Memorization), the first framework to quantify label memorization in semi-supervised node classification. We establish an inverse relationship between memorization and graph homophily, i.e., the tendency of connected nodes to share labels or features. Lower homophily significantly increases memorization, indicating that GNNs rely on label memorization when learning less homophilic graphs. We then analyze GNN training dynamics and find that increased memorization in low-homophily graphs is tightly coupled to GNNs' implicit bias toward using graph structure.
FairImagen: Post-Processing for Bias Mitigation in Text-to-Image Models
Text-to-image diffusion models, such as Stable Diffusion, have demonstrated remarkable capabilities in generating high-quality and diverse images from natural language prompts. However, recent studies reveal that these models often replicate and amplify societal biases, particularly along demographic attributes like gender and race. In this paper, we introduce FairImagen1, a post-hoc debiasing framework that operates on prompt embeddings to mitigate such biases without retraining or modifying the underlying diffusion model. Our method integrates Fair Principal Component Analysis to project CLIP-based input embeddings into a subspace that minimizes group-specific information while preserving semantic content. We further enhance debiasing effectiveness through empirical noise injection and propose a unified cross-demographic projection method that enables simultaneous debiasing across multiple demographic attributes. Extensive experiments across gender, race, and intersectional settings demonstrate that FairImagen significantly improves fairness with a moderate trade-off in image quality and prompt fidelity. Our framework outperforms existing post-hoc methods and offers a simple, scalable, and model-agnostic solution for equitable text-to-image generation.
The White House Is Ratcheting Up Its War Against Anthropic
This is how America loses the AI race. In theory, Donald Trump has a consistent position on AI. On the first full day of his second term, the president declared that he would use his full authority to speed the AI industry along and, in particular, to beat China in the AI race: "We have an emergency," he said. "We have to get this stuff built." If AI is poised to become the most important technology ever made, the thinking goes, whichever country commands the most powerful bots will dominate the rest of the century and beyond. The government, it seemed, would just get out of Silicon Valley's way.
Structure Aware Fusion with Progressive Injection for Molecular Representation Learning
Multimodal molecular models often suffer from 3D conformer unreliability and modality collapse, limiting their robustness and generalization. We propose MuMo, a structured multimodal fusion framework that addresses these challenges in molecular representation through two key strategies. To reduce the instability of conformer-dependent fusion, we design a Structured Fusion Pipeline (SFP) that combines 2D topology and 3D geometry into a unified and stable structural prior. To mitigate modality collapse caused by naive fusion, we introduce a Progressive Injection (PI) mechanism that asymmetrically integrates this prior into the sequence stream, preserving modality-specific modeling while enabling cross-modal enrichment. Built on a state space backbone, MuMo supports long-range dependency modeling and robust information propagation. Across 29 benchmark tasks from Therapeutics Data Commons (TDC) and MoleculeNet, MuMo achieves an average improvement of 2.7% over the best-performing baseline on each task, ranking first on 22 of them, including a 27% improvement on the LD50 task.