maxvit
Brain Stroke Detection and Classification Using CT Imaging with Transformer Models and Explainable AI
Qari, Shomukh, Thafar, Maha A.
Stroke is one of the leading causes of death globally, making early and accurate diagnosis essential for improving patient outcomes, particularly in emergency settings where timely intervention is critical. CT scans are the key imaging modality because of their speed, accessibility, and cost-effectiveness. This study proposed an artificial intelligence framework for multiclass stroke classification (ischemic, hemorrhagic, and no stroke) using CT scan images from a dataset provided by the Republic of Turkey's Ministry of Health. The proposed method adopted MaxViT, a state-of-the-art Vision Transformer, as the primary deep learning model for image-based stroke classification, with additional transformer variants (vision transformer, transformer-in-transformer, and ConvNext). To enhance model generalization and address class imbalance, we applied data augmentation techniques, including synthetic image generation. The MaxViT model trained with augmentation achieved the best performance, reaching an accuracy and F1-score of 98.00%, outperforming all other evaluated models and the baseline methods. The primary goal of this study was to distinguish between stroke types with high accuracy while addressing crucial issues of transparency and trust in artificial intelligence models. To achieve this, Explainable Artificial Intelligence (XAI) was integrated into the framework, particularly Grad-CAM++. It provides visual explanations of the model's decisions by highlighting relevant stroke regions in the CT scans and establishing an accurate, interpretable, and clinically applicable solution for early stroke detection. This research contributed to the development of a trustworthy AI-assisted diagnostic tool for stroke, facilitating its integration into clinical practice and enhancing access to timely and optimal stroke diagnosis in emergency departments, thereby saving more lives.
- Asia > Middle East > Republic of Türkiye (0.24)
- Europe > Belgium > Flanders (0.04)
- South America > Brazil > São Paulo (0.04)
- Asia > Middle East > Saudi Arabia (0.04)
- Health & Medicine > Therapeutic Area > Neurology (1.00)
- Health & Medicine > Therapeutic Area > Hematology (1.00)
- Health & Medicine > Therapeutic Area > Cardiology/Vascular Diseases (1.00)
- Health & Medicine > Diagnostic Medicine > Imaging (1.00)
MaxGlaViT: A novel lightweight vision transformer-based approach for early diagnosis of glaucoma stages from fundus images
Yurdakul, Mustafa, Uyar, Kubra, Tasdemir, Sakir
Glaucoma is a prevalent eye disease that progresses silently without symptoms. If not detected and treated early, it can cause permanent vision loss. Computer-assisted diagnosis systems play a crucial role in timely and efficient identification. This study introduces MaxGlaViT, a lightweight model based on the restructured Multi-Axis Vision Transformer (MaxViT) for early glaucoma detection. First, MaxViT was scaled to optimize block and channel numbers, resulting in a lighter architecture. Second, the stem was enhanced by adding attention mechanisms (CBAM, ECA, SE) after convolution layers to improve feature learning. Third, MBConv structures in MaxViT blocks were replaced by advanced DL blocks (ConvNeXt, ConvNeXtV2, InceptionNeXt). The model was evaluated using the HDV1 dataset, containing fundus images of different glaucoma stages. Additionally, 40 CNN and 40 ViT models were tested on HDV1 to validate MaxGlaViT's efficiency. Among CNN models, EfficientB6 achieved the highest accuracy (84.91%), while among ViT models, MaxViT-Tiny performed best (86.42%). The scaled MaxViT reached 87.93% accuracy. Adding ECA to the stem block increased accuracy to 89.01%. Replacing MBConv with ConvNeXtV2 further improved it to 89.87%. Finally, integrating ECA in the stem and ConvNeXtV2 in MaxViT blocks resulted in 92.03% accuracy. Testing 80 DL models for glaucoma stage classification, this study presents a comprehensive and comparative analysis. MaxGlaViT outperforms experimental and state-of-the-art models, achieving 92.03% accuracy, 92.33% precision, 92.03% recall, 92.13% f1-score, and 87.12% Cohen's kappa score.
- Asia > Middle East > Republic of Türkiye > Antalya Province > Antalya (0.04)
- Asia > Middle East > Republic of Türkiye > Konya Province > Konya (0.04)
- Asia > Indonesia > Bali (0.04)
DFCon: Attention-Driven Supervised Contrastive Learning for Robust Deepfake Detection
Shanto, MD Sadik Hossain, Dihan, Mahir Labib, Ghosh, Souvik, Anonto, Riad Ahmed, Chowdhury, Hafijul Hoque, Muhtasim, Abir, Ahsan, Rakib, Hassan, MD Tanvir, Sojib, MD Roqunuzzaman, Hakim, Sheikh Azizul, Rahman, M. Saifur
This report presents our approach for the IEEE SP Cup 2025: Deepfake Face Detection in the Wild (DFWild-Cup), focusing on detecting deepfakes across diverse datasets. Our methodology employs advanced backbone models, including MaxViT, CoAtNet, and EVA-02, fine-tuned using supervised contrastive loss to enhance feature separation. These models were specifically chosen for their complementary strengths. Integration of convolution layers and strided attention in MaxViT is well-suited for detecting local features. In contrast, hybrid use of convolution and attention mechanisms in CoAtNet effectively captures multi-scale features. Robust pretraining with masked image modeling of EVA-02 excels at capturing global features. After training, we freeze the parameters of these models and train the classification heads. Finally, a majority voting ensemble is employed to combine the predictions from these models, improving robustness and generalization to unseen scenarios. The proposed system addresses the challenges of detecting deepfakes in real-world conditions and achieves a commendable accuracy of 95.83% on the validation dataset.
Transforming Precision: A Comparative Analysis of Vision Transformers, CNNs, and Traditional ML for Knee Osteoarthritis Severity Diagnosis
Apon, Tasnim Sakib, Fahim-Ul-Islam, Md., Rafin, Nafiz Imtiaz, Akter, Joya, Alam, Md. Golam Rabiul
Knee osteoarthritis(KO) is a degenerative joint disease that can cause severe pain and impairment. With increased prevalence, precise diagnosis by medical imaging analytics is crucial for appropriate illness management. This research investigates a comparative analysis between traditional machine learning techniques and new deep learning models for diagnosing KO severity from X-ray pictures. This study does not introduce new architectural innovations but rather illuminates the robust applicability and comparative effectiveness of pre-existing ViT models in a medical imaging context, specifically for KO severity diagnosis. The insights garnered from this comparative analysis advocate for the integration of advanced ViT models in clinical diagnostic workflows, potentially revolutionizing the precision and reliability of KO assessments. This study does not introduce new architectural innovations but rather illuminates the robust applicability and comparative effectiveness of pre-existing ViT models in a medical imaging context, specifically for KO severity diagnosis. The insights garnered from this comparative analysis advocate for the integration of advanced ViT models in clinical diagnostic workflows, potentially revolutionizing the precision & reliability of KO assessments. The study utilizes an osteoarthritis dataset from the Osteoarthritis Initiative (OAI) comprising images with 5 severity categories and uneven class distribution. While classic machine learning models like GaussianNB and KNN struggle in feature extraction, Convolutional Neural Networks such as Inception-V3, VGG-19 achieve better accuracy between 55-65% by learning hierarchical visual patterns. However, Vision Transformer architectures like Da-VIT, GCViT and MaxViT emerge as indisputable champions, displaying 66.14% accuracy, 0.703 precision, 0.614 recall, AUC exceeding 0.835 thanks to self-attention processes.
- Asia > Bangladesh > Dhaka Division > Dhaka District > Dhaka (0.05)
- North America > United States > Maine (0.04)
- North America > United States > District of Columbia > Washington (0.04)
- (5 more...)
- Health & Medicine > Therapeutic Area > Rheumatology (1.00)
- Health & Medicine > Therapeutic Area > Musculoskeletal (1.00)
- Health & Medicine > Therapeutic Area > Immunology (1.00)
- Health & Medicine > Diagnostic Medicine > Imaging (1.00)
MaxVit -- Multi Axis Vision Transformer
Over the past few years, there has been a lot of competition of iterative improvements based on Convolution Nets and the relatively recent Transformer in terms of being the best architecture on the standard Image vision tasks. In the paper published in ECCV 2022, Researchers in Google Research and UT Austin introduce MaxVit. MaxVit -- Multi Axis Vision Transformer aims to combine the best features of both Convolution and Transformer by solving the issue related to global attention in transformers. We will first discuss the set of Vision Task for which these methods are applied .A typical vision task involved taking the input of 2D image and taking the RGB matrix format to your Neural network architecture. Image Classification is the problem of assigning labels to images from a fixed set of categories.
MaxViT: Multi-Axis Vision Transformer
Tu, Zhengzhong, Talebi, Hossein, Zhang, Han, Yang, Feng, Milanfar, Peyman, Bovik, Alan, Li, Yinxiao
Transformers have recently gained significant attention in the computer vision community. However, the lack of scalability of self-attention mechanisms with respect to image size has limited their wide adoption in state-of-the-art vision backbones. In this paper we introduce an efficient and scalable attention model we call multi-axis attention, which consists of two aspects: blocked local and dilated global attention. These design choices allow global-local spatial interactions on arbitrary input resolutions with only linear complexity. We also present a new architectural element by effectively blending our proposed attention model with convolutions, and accordingly propose a simple hierarchical vision backbone, dubbed MaxViT, by simply repeating the basic building block over multiple stages. Notably, MaxViT is able to ''see'' globally throughout the entire network, even in earlier, high-resolution stages. We demonstrate the effectiveness of our model on a broad spectrum of vision tasks. On image classification, MaxViT achieves state-of-the-art performance under various settings: without extra data, MaxViT attains 86.5% ImageNet-1K top-1 accuracy; with ImageNet-21K pre-training, our model achieves 88.7% top-1 accuracy. For downstream tasks, MaxViT as a backbone delivers favorable performance on object detection as well as visual aesthetic assessment. We also show that our proposed model expresses strong generative modeling capability on ImageNet, demonstrating the superior potential of MaxViT blocks as a universal vision module. The source code and trained models will be available at https://github.com/google-research/maxvit.
- North America > United States > Texas > Travis County > Austin (0.04)
- North America > United States > Florida > Broward County > Fort Lauderdale (0.04)