alignment
Improving Deep Learning for Accelerated MRI With Data Filtering
Deep neural networks achieve state-of-the-art results for accelerated MRI reconstruction. Most research on deep learning based imaging focuses on improving neural network architectures trained and evaluated on fixed and homogeneous training and evaluation data. In this work, we investigate data curation strategies for improving MRI reconstruction. We assemble a large dataset of raw k-space data from 18 public sources consisting of 1.1M images and construct a diverse evaluation set comprising 48 test sets, capturing variations in anatomy, contrast, number of coils, and other key factors. We propose and study different data filtering strategies to enhance performance of current state-of-the-art neural networks for accelerated MRI reconstruction. Our experiments show that filtering the training data leads to consistent, albeit modest, performance gains. These performance gains are robust across different training set sizes and accelerations, and we find that filtering is particularly beneficial when the proportion of in-distribution data in the unfiltered training set is low.
Extracting task-relevant preserved dynamics from contrastive aligned neural recordings
Recent work indicates that low-dimensional dynamics of neural and behavioral data are often preserved across days and subjects. However, extracting these preserved dynamics remains challenging: high-dimensional neural population activity and the recorded neuron populations vary across recording sessions. While existing modeling tools can improve alignment between neural and behavioral data, they often operate on a per-subject basis or discretize behavior into categories, disrupting its natural continuity and failing to capture the underlying dynamics. We introduce Contrastive Aligned Neural DYnamics (CANDY), an end-to-end framework that aligns neural and behavioral data using rank-based contrastive learning, adapted for continuous behavioral variables, to project neural activity from different sessions onto a shared low-dimensional embedding space. CANDY fits a shared linear dynamical system to the aligned embeddings, enabling an interpretable model of the conserved temporal structure in the latent space.
Aligning Text to Image in Diffusion Models is Easier Than You Think
While recent advancements in generative modeling have significantly improved text-image alignment, some residual misalignment between text and image representations still remains. Some approaches address this issue by fine-tuning models in terms of preference optimization, etc., which require tailored datasets. Orthogonal to these methods, we revisit the challenge from the perspective of representation alignment--an approach that has gained popularity with the success of REPresentation Alignment (REPA) [46]. We first argue that conventional text-to-image (T2I) diffusion models, typically trained on paired image and text data (i.e., positive pairs) by minimizing score matching or flow matching losses, is suboptimal from the standpoint of representation alignment.
ReAgent-V: AReward-Driven Multi-Agent Framework for Video Understanding
Video understanding is fundamental to tasks such as action recognition, video reasoning, and robotic control. Early video understanding methods based on large vision-language models (LVLMs) typically adopt a single-pass reasoning paradigm without dynamic feedback, limiting the model's capacity to self-correct and adapt in complex scenarios. Recent efforts have attempted to address this limitation by incorporating reward models and reinforcement learning to enhance reasoning, or by employing tool-agent frameworks. However, these approaches face several challenges, including high annotation costs, reward signals that fail to capture real-time reasoning states, and low inference efficiency. To overcome these issues, we propose ReAgent-V, a novel agentic video understanding framework that integrates efficient frame selection with real-time reward generation during inference. These reward signals not only guide iterative answer refinement through a multi-perspective reflection mechanism--adjusting predictions from conservative, neutral, and aggressive viewpoints--but also enable automatic filtering of high-quality data for supervised fine-tuning (SFT), direct preference optimization (DPO), and group relative policy optimization (GRPO). ReAgent-V is lightweight, modular, and extensible, supporting flexible tool integration tailored to diverse tasks. Extensive experiments on 12 datasets across three core applications--video understanding, video reasoning enhancement, and vision-language-action model alignment--demonstrate significant gains in generalization and reasoning, with improvements of up to 6.9%, 2.1%, and 9.8%, respectively, highlighting the effectiveness and versatility of the proposed framework.
Directed-Tokens: ARobust Multi-Modality Alignment Approach to Large Language-Vision Models
Large multimodal models (LMMs) have gained impressive performance due to their outstanding capability in various understanding tasks. However, these models still suffer from some fundamental limitations related to robustness and generalization due to the alignment and correlation between visual and textual features. In this paper, we introduce a simple but efficient learning mechanism for improving the robust alignment between visual and textual modalities by solving shuffling problems. In particular, the proposed approach can improve reasoning capability, visual understanding, and cross-modality alignment by introducing two new tasks: reconstructing the image order and the text order into the LMM's pretraining and fine-tuning phases. In addition, we propose a new directed-token approach to capture visual and textual knowledge, enabling the capability to reconstruct the correct order of visual inputs. Then, we introduce a new Image-toResponse Guided loss to further improve the visual understanding of the LMM in its responses. The proposed approach consistently achieves state-of-the-art (SoTA) performance compared with prior LMMs on academic task-oriented and instruction-following LMM benchmarks.
SAFEVID: Toward Safety Aligned Video Large Multimodal Models
As Video Large Multimodal Models (VLMMs) rapidly advance, their inherent complexity introduces significant safety challenges, particularly the issue of mismatched generalization where static safety alignments fail to transfer to dynamic video contexts. We introduce SAFEVID, a framework designed to instill video-specific safety principles in VLMMs. SAFEVID uniquely transfers robust textual safety alignment capabilities to the video domain by employing detailed textual video descriptions as an interpretive bridge, facilitating LLM-based rule-driven safety reasoning. This is achieved through a closed-loop system comprising: 1) generation of SafeVid-350K, a novel 350,000-pair video-specific safety preference dataset; 2) targeted alignment of VLMMs using Direct Preference Optimization (DPO); and 3) comprehensive evaluation via our new SafeVidBench benchmark. Alignment with SafeVid-350K significantly enhances VLMM safety, with models like LLaVA-NeXT-Video demonstrating substantial improvements (e.g., up to 42.39%) on SafeVidBench. SAFEVID provides critical resources and a structured approach, demonstrating that leveraging textual descriptions as a conduit for safety reasoning markedly improves the safety alignment of VLMMs.
1 Appendix 2 AMore Details
Score 0 4 (normal) is most common across cohorts, while score 3 (severe) is rare--especially in PD-GaM 5 and 3DGait, highlighting class imbalance challenges. BMCLab offers a balanced ON/OFF medication split, 7 while E-LC is skewed toward ON-medication. DNE includes healthy, Parkinsonian, and other disease 8 groups for broader contrastive training. Figure A.3 shows label distributions for FoG-related cohorts. This artifact likely stems from the unusual top-down perspective--different from the front15 facing or side views seen in WHAM's training data [1]. While motion encoder-based models may be 16 robust to such distortions, feature-based gait classifiers rely on precise kinematic measurements and 17 thus require carefully corrected input data. To correct this slope artifact, we perform a frame-wise 18 rigid alignment of the reconstructed SMPL skeleton using the Kabsch algorithm [2]. The goal is to 19 rotate each frame so that anatomical directions align with canonical coordinate axes (up, forward), 20 while preserving natural gait structure. This motion 28 vector is then projected onto the ground plane (xz-plane) and used as the walking axis. In frames where the sacrum displacement is less than 30 4mm--indicating near-stationary posture--we fall back on a proxy direction: the cross product of 31 the hip vector (left hip to right hip) and the vertical vector.
From Likelihood to Fitness: Improving Variant Effect Prediction in Protein and Genome Language Models
Generative models trained on natural sequences are increasingly used to predict the effects of genetic variation, enabling progress in therapeutic design, disease risk prediction, and synthetic biology. In the zero-shot setting, variant impact is estimated by comparing the likelihoods of sequences, under the assumption that likelihood serves as a proxy for fitness. However, this assumption often breaks down in practice: sequence likelihood reflects not only evolutionary fitness constraints, but also phylogenetic structure and sampling biases, especially as model capacity increases. We introduce Likelihood-Fitness Bridging (LFB), a simple and general strategy that improves variant effect prediction by averaging model scores across sequences subject to similar selective pressures. Assuming an Ornstein-Uhlenbeck model of evolution, LFB can be viewed as a way to marginalize the effects of genetic drift, although its benefits appear to extend more broadly. LFB applies to existing protein and genomic language models without requiring retraining, and incurs only modest computational overhead. Evaluated on largescale deep mutational scans and clinical benchmarks, LFB consistently improves predictive performance across model families and sizes. Notably, it reverses the performance plateau observed in larger protein language models, making the largest models the most accurate when combined with LFB. These results suggest that accounting for phylogenetic and sampling biases is essential to realizing the full potential of large sequence models in variant effect prediction.
Generative RLHF-V: Learning Principles from Multi-modal Human Preference
Training multi-modal large language models (MLLMs) that align with human intentions is a long-term challenge. Traditional score-only reward models for alignment suffer from low accuracy, weak generalization, and poor interpretability, blocking the progress of alignment methods, e.g., reinforcement learning from human feedback (RLHF). Generative reward models (GRMs) leverage MLLMs' intrinsic reasoning capabilities to discriminate pair-wise responses, but their pair-wise paradigm makes it hard to generalize to learnable rewards. We introduce Generative RLHF-V, a novel alignment framework that integrates GRMs with multi-modal RLHF. We propose a two-stage pipeline: multi-modal generative reward modeling from RL, where RL guides GRMs to actively capture human intention, then predict the correct pair-wise scores; and RL optimization from grouped comparison, which enhances multi-modal RL scoring precision by grouped responses comparison. Experimental results demonstrate that, besides out-of-distribution generalization of RM discrimination, our framework improves 4 MLLMs' performance across 7 benchmarks by 18.1%, while the baseline RLHF is only 5.3%. We further validate that Generative RLHF-V achieves a near-linear improvement with an increasing number of candidate responses.