WaveFormer: A 3D Transformer with Wavelet-Driven Feature Representation for Efficient Medical Image Segmentation

Hasan, Md Mahfuz Al, Zaman, Mahdi, Jawad, Abdul, Santamaria-Pang, Alberto, Lee, Ho Hin, Tarapov, Ivan, See, Kyle, Imran, Md Shah, Roy, Antika, Fallah, Yaser Pourmohammadi, Asadizanjani, Navid, Forghani, Reza

arXiv.org Artificial Intelligence 

Transformer-based architectures have advanced medical image analysis by effectively modeling long-range dependencies, yet they often struggle in 3D settings due to substantial memory overhead and insufficient capture of fine-grained local features. We address these limitations with WaveFormer, a novel 3D-transformer that: i) leverages the fundamental frequency-domain properties of features for contextual representation, and ii) is inspired by the top-down mechanism of the human visual recognition system, making it a biologically motivated architecture. By employing discrete wavelet transformations (DWT) at multiple scales, WaveFormer preserves both global context and high-frequency details while replacing heavy upsampling layers with efficient wavelet-based summarization and reconstruction. This significantly reduces the number of parameters, which is critical for real-world deployment where computational resources and training times are constrained. Furthermore, the model is generic and easily adaptable to diverse applications. Evaluations on BraTS2023, FLARE2021, and KiTS2023 demonstrate performance on par with state-of-the-art methods while offering substantially lower computational complexity. Keywords: Transformer Model Multi-level Attention Discrete Wavelet Transform 1 Introduction Medical image segmentation is fundamental to clinical applications such as tumor delineation, organ localization, and surgical planning. Deep learning-based approaches, particularly convolutional neural networks (CNNs), have demonstrated significant success by hierarchically extracting features. However, their limited receptive fields hinder the capture of long-range dependencies, a critical shortcoming in 3D applications where spatial context across distant slices arXiv:2503.23764v2