Goto

Collaborating Authors

 Pan, Jiazhen


MedVLM-R1: Incentivizing Medical Reasoning Capability of Vision-Language Models (VLMs) via Reinforcement Learning

arXiv.org Artificial Intelligence

Reasoning is a critical frontier for advancing medical image analysis, where transparency and trustworthiness play a central role in both clinician trust and regulatory approval. Although Medical Visual Language Models (VLMs) show promise for radiological tasks, most existing VLMs merely produce final answers without revealing the underlying reasoning. To address this gap, we introduce MedVLM-R1, a medical VLM that explicitly generates natural language reasoning to enhance transparency and trustworthiness. Instead of relying on supervised fine-tuning (SFT), which often suffers from overfitting to training distributions and fails to foster genuine reasoning, MedVLM-R1 employs a reinforcement learning framework that incentivizes the model to discover human-interpretable reasoning paths without using any reasoning references. Despite limited training data (600 visual question answering samples) and model parameters (2B), MedVLM-R1 boosts accuracy from 55.11% to 78.22% across MRI, CT, and X-ray benchmarks, outperforming larger models trained on over a million samples. It also demonstrates robust domain generalization under out-of-distribution tasks. By unifying medical image analysis with explicit reasoning, MedVLM-R1 marks a pivotal step toward trustworthy and interpretable AI in clinical practice. Inference model is available at: https://huggingface.co/JZPeterPan/ MedVLM-R1.


Whole Heart 3D+T Representation Learning Through Sparse 2D Cardiac MR Images

arXiv.org Artificial Intelligence

Cardiac Magnetic Resonance (CMR) imaging serves as the gold-standard for evaluating cardiac morphology and function. Typically, a multi-view CMR stack, covering short-axis (SA) and 2/3/4-chamber long-axis (LA) views, is acquired for a thorough cardiac assessment. However, efficiently streamlining the complex, high-dimensional 3D+T CMR data and distilling compact, coherent representation remains a challenge. In this work, we introduce a whole-heart self-supervised learning framework that utilizes masked imaging modeling to automatically uncover the correlations between spatial and temporal patches throughout the cardiac stacks. This process facilitates the generation of meaningful and well-clustered heart representations without relying on the traditionally required, and often costly, labeled data. The learned heart representation can be directly used for various downstream tasks. Furthermore, our method demonstrates remarkable robustness, ensuring consistent representations even when certain CMR planes are missing/flawed. We train our model on 14,000 unlabeled CMR data from UK BioBank and evaluate it on 1,000 annotated data. The proposed method demonstrates superior performance to baselines in tasks that demand comprehensive 3D+T cardiac information, e.g.


Direct Cardiac Segmentation from Undersampled K-space Using Transformers

arXiv.org Artificial Intelligence

The prevailing deep learning-based methods of predicting cardiac segmentation involve reconstructed magnetic resonance (MR) images. The heavy dependency of segmentation approaches on image quality significantly limits the acceleration rate in fast MR reconstruction. Moreover, the practice of treating reconstruction and segmentation as separate sequential processes leads to artifact generation and information loss in the intermediate stage. These issues pose a great risk to achieving high-quality outcomes. To leverage the redundant k-space information overlooked in this dual-step pipeline, we introduce a novel approach to directly deriving segmentations from sparse k-space samples using a transformer (DiSK). DiSK operates by globally extracting latent features from 2D+time k-space data with attention blocks and subsequently predicting the segmentation label of query points. We evaluate our model under various acceleration factors (ranging from 4 to 64) and compare against two image-based segmentation baselines. Our model consistently outperforms the baselines in Dice and Hausdorff distances across foreground classes for all presented sampling rates.


Attention-aware non-rigid image registration for accelerated MR imaging

arXiv.org Artificial Intelligence

Accurate motion estimation at high acceleration factors enables rapid motion-compensated reconstruction in Magnetic Resonance Imaging (MRI) without compromising the diagnostic image quality. In this work, we introduce an attention-aware deep learning-based framework that can perform non-rigid pairwise registration for fully sampled and accelerated MRI. We extract local visual representations to build similarity maps between the registered image pairs at multiple resolution levels and additionally leverage long-range contextual information using a transformer-based module to alleviate ambiguities in the presence of artifacts caused by undersampling. We combine local and global dependencies to perform simultaneous coarse and fine motion estimation. The proposed method was evaluated on in-house acquired fully sampled and accelerated data of 101 patients and 62 healthy subjects undergoing cardiac and thoracic MRI. The impact of motion estimation accuracy on the downstream task of motion-compensated reconstruction was analyzed. We demonstrate that our model derives reliable and consistent motion fields across different sampling trajectories (Cartesian and radial) and acceleration factors of up to 16x for cardiac motion and 30x for respiratory motion and achieves superior image quality in motion-compensated reconstruction qualitatively and quantitatively compared to conventional and recent deep learning-based approaches. The code is publicly available at https://github.com/lab-midas/GMARAFT.


Neural Implicit k-Space for Binning-free Non-Cartesian Cardiac MR Imaging

arXiv.org Artificial Intelligence

In this work, we propose a novel image reconstruction framework that directly learns a neural implicit representation in k-space for ECG-triggered non-Cartesian Cardiac Magnetic Resonance Imaging (CMR). While existing methods bin acquired data from neighboring time points to reconstruct one phase of the cardiac motion, our framework allows for a continuous, binning-free, and subject-specific k-space representation. We assign a unique coordinate that consists of time, coil index, and frequency domain location to each sampled k-space point. We then learn the subject-specific mapping from these unique coordinates to k-space intensities using a multi-layer perceptron with frequency domain regularization. During inference, we obtain a complete k-space for Cartesian coordinates and an arbitrary temporal resolution. A simple inverse Fourier transform recovers the image, eliminating the need for density compensation and costly non-uniform Fourier transforms for non-Cartesian data. This novel imaging framework was tested on 42 radially sampled datasets from 6 subjects. The proposed method outperforms other techniques qualitatively and quantitatively using data from four and one heartbeat(s) and 30 cardiac phases. Our results for one heartbeat reconstruction of 50 cardiac phases show improved artifact removal and spatio-temporal resolution, leveraging the potential for real-time CMR.


Reconstruction-driven motion estimation for motion-compensated MR CINE imaging

arXiv.org Artificial Intelligence

In cardiac CINE, motion-compensated MR reconstruction (MCMR) is an effective approach to address highly undersampled acquisitions by incorporating motion information between frames. In this work, we propose a deep learning-based framework to address the MCMR problem efficiently. Contrary to state-of-the-art (SOTA) MCMR methods which break the original problem into two sub-optimization problems, i.e. motion estimation and reconstruction, we formulate this problem as a single entity with one single optimization. We discard the canonical motion-warping loss (similarity measurement between motion-warped images and target images) to estimate the motion, but drive the motion estimation process directly by the final reconstruction performance. The higher reconstruction quality is achieved without using any smoothness loss terms and without iterative processing between motion estimation and reconstruction. Therefore, we avoid non-trivial loss weighting factors tuning and time-consuming iterative processing. Experiments on 43 in-house acquired 2D CINE datasets indicate that the proposed MCMR framework can deliver artifact-free motion estimation and high-quality MR images even for imaging accelerations up to 20x. The proposed framework is compared to SOTA non-MCMR and MCMR methods and outperforms these methods qualitatively and quantitatively in all applied metrics across all experiments with different acceleration rates.


LAPNet: Non-rigid Registration derived in k-space for Magnetic Resonance Imaging

arXiv.org Artificial Intelligence

Physiological motion, such as cardiac and respiratory motion, during Magnetic Resonance (MR) image acquisition can cause image artifacts. Motion correction techniques have been proposed to compensate for these types of motion during thoracic scans, relying on accurate motion estimation from undersampled motion-resolved reconstruction. A particular interest and challenge lie in the derivation of reliable non-rigid motion fields from the undersampled motion-resolved data. Motion estimation is usually formulated in image space via diffusion, parametric-spline, or optical flow methods. However, image-based registration can be impaired by remaining aliasing artifacts due to the undersampled motion-resolved reconstruction. In this work, we describe a formalism to perform non-rigid registration directly in the sampled Fourier space, i.e. k-space. We propose a deep-learning based approach to perform fast and accurate non-rigid registration from the undersampled k-space data. The basic working principle originates from the Local All-Pass (LAP) technique, a recently introduced optical flow-based registration. The proposed LAPNet is compared against traditional and deep learning image-based registrations and tested on fully-sampled and highly-accelerated (with two undersampling strategies) 3D respiratory motion-resolved MR images in a cohort of 40 patients with suspected liver or lung metastases and 25 healthy subjects. The proposed LAPNet provided consistent and superior performance to image-based approaches throughout different sampling trajectories and acceleration factors.