KAN-Mixers: a new deep learning architecture for image classification

Canuto, Jorge Luiz dos Santos, Aylon, Linnyer Beatrys Ruiz, de Souza, Rodrigo Clemente Thom

arXiv.org Artificial Intelligence 

Computer vision is a field of artificial intelligence that encompasses methods and techniques that provide machines with the ability to learn from image data. This area of computer science includes software, hardware, and imaging techniques required for such methods [1]. In this context, there are several computer vision tasks that can be solved by machines and that find applications in various areas of society, namely: engine fault diagnosis [2], astronomy [3], human-computer interface [4], object detection [5, 6], facial recognition [7], among others. In addition, several deep learning models are proposed to solve such tasks. With their architecture based on convolutional layers, Convolutional Neural Networks (CNNs) [8] dominated computer vision tasks for a few years. Recently, Transformer-based architectures, specifically Vision Transformer (ViT) [9] and Swin Transformer [10], emerged as an alternative based on self-attention layers, a mechanism that learns relationships between different image patches. Thus, Transformers have demonstrated attractive performance, often outperforming CNNs, especially on large datasets [11, 12, 13]. In 2021, Google proposed MLP-Mixer [11], a more concise visual architecture with higher inference speed than ViT. Despite its simple structure, which relies only on Multilayer Perceptron (MLP), MLP-Mixer achieves extremely competitive results, as demonstrated in Tolstikhin (2021).

Duplicate Docs Excel Report

Title
None found

Similar Docs  Excel Report  more

TitleSimilaritySource
None found