Genre
Taming Hyperparameter Sensitivity in Data Attribution: Practical Selection Without Costly Retraining
Data attribution methods, which quantify the influence of individual training data points on a machine learning model, have gained increasing popularity in data-centric applications in modern AI. Despite a recent surge of new methods developed in this space, the impact of hyperparameter tuning in these methods remains under-explored. In this work, we present the first large-scale empirical study to understand the hyperparameter sensitivity of common data attribution methods. Our results show that most methods are indeed sensitive to certain key hyperparameters. However, unlike typical machine learning algorithms---whose hyperparameters can be tuned using computationally-cheap validation metrics---evaluating data attribution performance often requires retraining models on subsets of training data, making such metrics prohibitively costly for hyperparameter tuning.
Promptable 3-D Object Localization with Latent Diffusion Models
Accurate identification and localization of objects in 3-D scenes are essential for advancing comprehensive 3-D scene understanding. Although diffusion models have demonstrated impressive capabilities across a broad spectrum of computer vision tasks, their potential in both 2-D and 3-D object detection remains underexplored. Existing approaches typically formulate detection as a ''noise-to-box'' process, but they rely heavily on direct coordinate regression, which limits adaptability for more advanced tasks such as grounding-based object detection. To overcome these challenges, we propose a promptable 3-D object recognition framework, which introduces a diffusion-based paradigm for flexible and conditionally guided 3-D object detection. Our approach encodes bounding boxes into latent representations and employs latent diffusion models to realize a ''promptable noise-to-box'' transformation. This formulation enables the refinement of standard 3-D object detection using textual prompts, such as class labels. Moreover, it naturally extends to grounding object detection through conditioning on natural language descriptions, and generalizes effectively to few-shot learning by incorporating annotated exemplars as visual prompts. We conduct thorough evaluations on three key 3-D object recognition tasks: general 3-D object detection, few-shot detection, and grounding-based detection. Experimental results demonstrate that our framework achieves competitive performance relative to state-of-the-art methods, validating its effectiveness, versatility, and broad applicability in 3-D computer vision.
DGSolver: Diffusion Generalist Solver with Universal Posterior Sampling for Image Restoration
Diffusion models have achieved remarkable progress in universal image restoration. However, existing methods perform naive inference in the reverse process, which leads to cumulative errors under limited sampling steps and large step intervals. Moreover, they struggle to balance the commonality of degradation representations with restoration quality, often depending on complex compensation mechanisms that enhance fidelity at the expense of efficiency. To address these challenges, we introduce \textbf{DGSolver}, a diffusion generalist solver with universal posterior sampling. We first derive the exact ordinary differential equations for generalist diffusion models to unify degradation representations and design tailored high-order solvers with a queue-based accelerated sampling strategy to improve both accuracy and efficiency. We then integrate universal posterior sampling to better approximate manifold-constrained gradients, yielding a more accurate noise estimation and correcting errors in inverse inference. Extensive experiments demonstrate that DGSolver outperforms state-of-the-art methods in restoration accuracy, stability, and scalability, both qualitatively and quantitatively.
Improving Regret Approximation for Unsupervised Dynamic Environment Generation
Unsupervised Environment Design (UED) seeks to automatically generate training curricula for reinforcement learning (RL) agents, with the goal of improving generalisation and zero-shot performance. However, designing effective curricula remains a difficult problem, particularly in settings where small subsets of environment parameterisations result in significant increases in the complexity of the required policy. Current methods struggle with a difficult credit assignment problem and rely on regret approximations that fail to identify challenging levels, both of which are compounded as the size of the environment grows. We propose Dynamic Environment Generation for UED (DEGen) to enable a denser level generator reward signal, reducing the difficulty of credit assignment and allowing for UED to scale to larger environment sizes. We also introduce a new regret approximation, Maximised Negative Advantage (MNA), as a significantly improved metric to optimise for, that better identifies more challenging levels. We show empirically that MNA outperforms current regret approximations and when combined with DEGen, consistently outperforms existing methods, especially as the size of the environment grows.
STAR: Efficient Preference-based Reinforcement Learning via Dual Regularization
However, due to the high cost of obtaining feedback, PbRL typically relies on a limited set of preference-labeled samples. This data scarcity introduces two key inefficiencies: (1) the reward model overfits to the limited feedback, leading to poor generalization to unseen samples, and (2) the agent exploits the learned reward model, exacerbating overestimation of action values in temporal difference (TD) learning. To address these issues, we propose STAR, an efficient PbRL method that integrates preference margin regularization and policy regularization.
To Think or Not To Think: A Study of Thinking in Rule-Based Visual Reinforcement Fine-Tuning
This paper investigates the role of explicit thinking process in rule-based reinforcement fine-tuning (RFT) for multi-modal large language models (MLLMs). We first extend \textit{Thinking-RFT} to image classification task, using verifiable rewards for fine-tuning~(FT). Experiments show {Thinking-RFT} significantly outperforms supervised FT and yields a cross-dataset generalization effect. We then rethink and question whether explicit thinking in RFT is always necessary and beneficial. Challenging the convention that explicit thinking is crucial for the success of RFT, we introduce \textit{No-Thinking-RFT}, exploring RFT without thinking by introducing a simple equality accuracy reward. We evaluate No-Thinking-RFT on six diverse tasks across different model sizes and types. Experiment results reveal four key findings: \textbf{(1).} Visual perception tasks do not require thinking during RFT, as No-Thinking-RFT consistently outperforms or matches Thinking-RFT across model sizes and types.
On the Stability and Generalization of Meta-Learning: the Impact of Inner-Levels
Meta-learning has achieved significant advancements, with generalization emerging as a key metric for evaluating meta-learning algorithms. While recent studies have mainly focused on training strategies, data-split methods, and tightening generalization bounds, they often ignore the impact of inner-levels on generalization. To bridge this gap, this paper focuses on several prominent meta-learning algorithms and establishes two generalization analytical frameworks for them based on their inner-processes: the Gradient Descent Framework (GDF) and the Proximal Descent Framework (PDF). Within these frameworks, we introduce two novel algorithmic stability definitions and derive the corresponding generalization bounds. Our findings reveal a trade-off of inner-levels under GDF, whereas PDF exhibits a beneficial relationship. Moreover, we highlight the critical role of the meta-objective function in minimizing generalization error. Inspired by this, we propose a new, simplified meta-objective function definition to enhance generalization performance. Many real-world experiments support our findings and show the improvement of the new meta-objective function.
Environment Inference for Learning Generalizable Dynamical System
Data-driven methods offer efficient and robust solutions for analyzing complex dynamical systems but rely on the assumption of I.I.D. data, driving the development of generalization techniques for handling environmental differences. These techniques, however, are limited by their dependence on environment labels, which are often unavailable during training due to data acquisition challenges, privacy concerns, and environmental variability, particularly in large public datasets and privacy-sensitive domains. In response, we propose DynaInfer, a novel method that infers environment specifications by analyzing prediction errors from fixed neural networks within each training round, enabling environment assignments directly from data. We prove our algorithm effectively solves the alternating optimization problem in unlabeled scenarios and validate it through extensive experiments across diverse dynamical systems. Results show that DynaInfer outperforms existing environment assignment techniques, converges rapidly to true labels, and even achieves superior performance when environment labels are available.
Data Mixing Can Induce Phase Transitions in Knowledge Acquisition
Large Language Models (LLMs) are typically trained on data mixtures: most data come from web scrapes, while a small portion is curated from high-quality sources with dense domain-specific knowledge. In this paper, we show that when training LLMs on such data mixtures, knowledge acquisition from knowledge-dense datasets--unlike training exclusively on knowledge-dense data--does not always follow a smooth scaling law but can exhibit phase transitions with respect to the mixing ratio and model size. Through controlled experiments on a synthetic biography dataset mixed with web-scraped data, we demonstrate that: (1) as we increase the model size to a critical value, the model suddenly transitions from memorizing very few to most of the biographies; (2) below a critical mixing ratio, the model memorizes almost nothing even with extensive training, but beyond this threshold, it rapidly memorizes more biographies. We attribute these phase transitions to a capacity allocation phenomenon: a model with bounded capacity must act like a knapsack problem solver to minimize the overall test loss, and the optimal allocation across datasets can change discontinuously as the model size or mixing ratio varies. We formalize this intuition in an information-theoretic framework and reveal that these phase transitions are predictable, with the critical mixing ratio following a power-law relationship with the model size. Our findings highlight a concrete case where a good mixing recipe for large models may not be optimal for small models, and vice versa.
VMDT: Decoding the Trustworthiness of Video Foundation Models
As foundation models become more sophisticated, ensuring their trustworthiness becomes increasingly critical; yet, unlike text and image, the video modality still lacks comprehensive trustworthiness benchmarks. We introduce VMDT (Video-Modal DecodingTrust), the first unified platform for evaluating text-to-video (T2V) and video-to-text (V2T) models across five key trustworthiness dimensions: safety, hallucination, fairness, privacy, and adversarial robustness. Through our extensive evaluation of 7 T2V models and 19 V2T models using VMDT, we uncover several significant insights. For instance, all open-source T2V models evaluated fail to recognize harmful queries and often generate harmful videos, while exhibiting higher levels of unfairness compared to image modality models. In V2T models, unfairness and privacy risks rise with scale, whereas hallucination and adversarial robustness improve---though overall performance remains low. Uniquely, safety shows no correlation with model size, implying that factors other than scale govern current safety levels. Our findings highlight the urgent need for developing more robust and trustworthy video foundation models, and VMDT provides a systematic framework for measuring and tracking progress toward this goal.