AITopics | mlp-mixer

Collaborating Authors

mlp-mixer

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

Vision Hopfield Memory Networks

Wang, Jianfeng, M'Charrak, Amine, Koska, Luk, Wang, Xiangtao, Petriceanu, Daniel, Smyrnov, Mykyta, Wang, Ruizhi, Bumbar, Michael, Pinchetti, Luca, Lukasiewicz, Thomas

arXiv.org Machine LearningMar-27-2026

Recent vision and multimodal foundation backbones, such as Transformer families and state-space models like Mamba, have achieved remarkable progress, enabling unified modeling across images, text, and beyond. Despite their empirical success, these architectures remain far from the computational principles of the human brain, often demanding enormous amounts of training data while offering limited interpretability. In this work, we propose the Vision Hopfield Memory Network (V-HMN), a brain-inspired foundation backbone that integrates hierarchical memory mechanisms with iterative refinement updates. Specifically, V-HMN incorporates local Hopfield modules that provide associative memory dynamics at the image patch level, global Hopfield modules that function as episodic memory for contextual modulation, and a predictive-coding-inspired refinement rule for iterative error correction. By organizing these memory-based modules hierarchically, V-HMN captures both local and global dynamics in a unified framework. Memory retrieval exposes the relationship between inputs and stored patterns, making decisions more interpretable, while the reuse of stored patterns improves data efficiency. This brain-inspired design therefore enhances interpretability and data efficiency beyond existing self-attention- or state-space-based approaches. We conducted extensive experiments on public computer vision benchmarks, and V-HMN achieved competitive results against widely adopted backbone architectures, while offering better interpretability, higher data efficiency, and stronger biological plausibility. These findings highlight the potential of V-HMN to serve as a next-generation vision foundation model, while also providing a generalizable blueprint for multimodal backbones in domains such as text and audio, thereby bridging brain-inspired computation with large-scale machine learning.

artificial intelligence, deep learning, machine learning, (18 more...)

arXiv.org Machine Learning

2603.25157

Country:

North America > Canada > Ontario > Toronto (0.14)
Europe > United Kingdom (0.04)
Europe > Austria > Vienna (0.04)

Genre: Research Report (0.82)

Industry: Health & Medicine > Therapeutic Area > Neurology (0.46)

Technology:

Information Technology > Sensing and Signal Processing > Image Processing (1.00)
Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)

Add feedback

MLP-Mixer: An all-MLP Architecture for Vision

Neural Information Processing SystemsDec-24-2025, 22:11:50 GMT

Convolutional Neural Networks (CNNs) are the go-to model for computer vision. Recently, attention-based networks, such as the Vision Transformer, have also become popular. In this paper we show that while convolutions and attention are both sufficient for good performance, neither of them are necessary. We present MLP-Mixer, an architecture based exclusively on multi-layer perceptrons (MLPs). MLP-Mixer contains two types of layers: one with MLPs applied independently to image patches (i.e.

all-mlp architecture, mlp-mixer, name change, (4 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Perceptrons (0.61)

Add feedback

MLP-Mixer: An all-MLP Architecture for Vision

Neural Information Processing SystemsMay-27-2025, 04:13:52 GMT

all-mlp architecture, image understanding, machine learning, (3 more...)

Neural Information Processing Systems

Technology:

Information Technology > Artificial Intelligence > Vision > Image Understanding (0.91)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Perceptrons (0.65)

Add feedback

KAN-Mixers: a new deep learning architecture for image classification

Canuto, Jorge Luiz dos Santos, Aylon, Linnyer Beatrys Ruiz, de Souza, Rodrigo Clemente Thom

arXiv.org Artificial IntelligenceMar-11-2025

Computer vision is a field of artificial intelligence that encompasses methods and techniques that provide machines with the ability to learn from image data. This area of computer science includes software, hardware, and imaging techniques required for such methods [1]. In this context, there are several computer vision tasks that can be solved by machines and that find applications in various areas of society, namely: engine fault diagnosis [2], astronomy [3], human-computer interface [4], object detection [5, 6], facial recognition [7], among others. In addition, several deep learning models are proposed to solve such tasks. With their architecture based on convolutional layers, Convolutional Neural Networks (CNNs) [8] dominated computer vision tasks for a few years. Recently, Transformer-based architectures, specifically Vision Transformer (ViT) [9] and Swin Transformer [10], emerged as an alternative based on self-attention layers, a mechanism that learns relationships between different image patches. Thus, Transformers have demonstrated attractive performance, often outperforming CNNs, especially on large datasets [11, 12, 13]. In 2021, Google proposed MLP-Mixer [11], a more concise visual architecture with higher inference speed than ViT. Despite its simple structure, which relies only on Multilayer Perceptron (MLP), MLP-Mixer achieves extremely competitive results, as demonstrated in Tolstikhin (2021).

architecture, dataset, kan-mixer model, (13 more...)

arXiv.org Artificial Intelligence

2503.08939

Country: South America > Brazil > Paraná (0.05)

Genre: Research Report > Experimental Study (0.50)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

MLP-Mixer: An all-MLP Architecture for Vision

Neural Information Processing SystemsJan-19-2025, 05:02:38 GMT

all-mlp architecture, mlp-mixer, transformer

Neural Information Processing Systems

Technology:

Information Technology > Artificial Intelligence > Vision > Image Understanding (0.91)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Perceptrons (0.65)

Add feedback

MixMAS: A Framework for Sampling-Based Mixer Architecture Search for Multimodal Fusion and Learning

Chergui, Abdelmadjid, Bezirganyan, Grigor, Sellami, Sana, Berti-Équille, Laure, Fournier, Sébastien

arXiv.org Artificial IntelligenceDec-24-2024

Choosing a suitable deep learning architecture for multimodal data fusion is a challenging task, as it requires the effective integration and processing of diverse data types, each with distinct structures and characteristics. In this paper, we introduce MixMAS, a novel framework for sampling-based mixer architecture search tailored to multimodal learning. Our approach automatically selects the optimal MLP-based architecture for a given multimodal machine learning (MML) task. Specifically, MixMAS utilizes a sampling-based micro-benchmarking strategy to explore various combinations of modality-specific encoders, fusion functions, and fusion networks, systematically identifying the architecture that best meets the task's performance metrics.

architecture, artificial intelligence, machine learning, (13 more...)

arXiv.org Artificial Intelligence

2412.18437

Country:

Europe > France (0.31)
North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)

Genre: Research Report (0.82)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)

Add feedback

Interpolated-MLPs: Controllable Inductive Bias

Wu, Sean, Hong, Jordan, Bai, Keyu, Bachmann, Gregor

arXiv.org Machine LearningOct-12-2024

Due to their weak inductive bias, Multi-Layer Perceptrons (MLPs) have subpar performance at low-compute levels compared to standard architectures such as convolution-based networks (CNN). Recent work, however, has shown that the performance gap drastically reduces as the amount of compute is increased without changing the amount of inductive bias. In this work, we study the converse: in the low-compute regime, how does the incremental increase of inductive bias affect performance? To quantify inductive bias, we propose a "soft MLP" approach, which we coin Interpolated MLP (I-MLP). We control the amount of inductive bias in the standard MLP by introducing a novel algorithm based on interpolation between fixed weights from a prior model with high inductive bias. We showcase our method using various prior models, including CNNs and the MLP-Mixer architecture. This interpolation scheme allows fractional control of inductive bias, which may be attractive when full inductive bias is not desired (e.g. in the mid-compute regime). We find experimentally that for Vision Tasks in the low-compute regime, there is a continuous and two-sided logarithmic relationship between inductive bias and performance when using CNN and MLP-Mixer prior models.

i-mlp, inductive bias, interpolation, (16 more...)

arXiv.org Machine Learning

2410.09655

Country:

North America > Canada > Ontario > Toronto (0.14)
Europe > Switzerland > Zürich > Zürich (0.14)
Asia > Middle East > Jordan (0.04)

Genre: Research Report > New Finding (0.46)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Perceptrons (0.54)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)

Add feedback

Demonstrating the Efficacy of Kolmogorov-Arnold Networks in Vision Tasks

Cheon, Minjong

arXiv.org Artificial IntelligenceJun-21-2024

In the realm of deep learning, the Kolmogorov-Arnold Network (KAN) has emerged as a potential alternative to multilayer projections (MLPs). However, its applicability to vision tasks has not been extensively validated. In our study, we demonstrated the effectiveness of KAN for vision tasks through multiple trials on the MNIST, CIFAR10, and CIFAR100 datasets, using a training batch size of 32. Our results showed that while KAN outperformed the original MLP-Mixer on CIFAR10 and CIFAR100, it performed slightly worse than the state-of-the-art ResNet-18. These findings suggest that KAN holds significant promise for vision tasks, and further modifications could enhance its performance in future evaluations.Our contributions are threefold: first, we showcase the efficiency of KAN-based algorithms for visual tasks; second, we provide extensive empirical assessments across various vision benchmarks, comparing KAN's performance with MLP-Mixer, CNNs, and Vision Transformers (ViT); and third, we pioneer the use of natural KAN layers in visual tasks, addressing a gap in previous research. This paper lays the foundation for future studies on KANs, highlighting their potential as a reliable alternative for image classification tasks.

dataset, kan, kolmogorov-arnold network, (13 more...)

arXiv.org Artificial Intelligence

2406.14916

Genre: Research Report > New Finding (1.00)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.92)

Add feedback

Hierarchical Associative Memory, Parallelized MLP-Mixer, and Symmetry Breaking

Karakida, Ryo, Ota, Toshihiro, Taki, Masato

arXiv.org Machine LearningJun-17-2024

Transformers have established themselves as the leading neural network model in natural language processing and are increasingly foundational in various domains. In vision, the MLP-Mixer model has demonstrated competitive performance, suggesting that attention mechanisms might not be indispensable. Inspired by this, recent research has explored replacing attention modules with other mechanisms, including those described by MetaFormers. However, the theoretical framework for these models remains underdeveloped. This paper proposes a novel perspective by integrating Krotov's hierarchical associative memory with MetaFormers, enabling a comprehensive representation of the entire Transformer block, encompassing token-/channel-mixing modules, layer normalization, and skip connections, as a single Hopfield network. This approach yields a parallelized MLP-Mixer derived from a three-layer Hopfield network, which naturally incorporates symmetric token-/channel-mixing modules and layer normalization. Empirical studies reveal that symmetric interaction matrices in the model hinder performance in image recognition tasks. Introducing symmetry-breaking effects transitions the performance of the symmetric parallelized MLP-Mixer to that of the vanilla MLP-Mixer. This indicates that during standard training, weight matrices of the vanilla MLP-Mixer spontaneously acquire a symmetry-breaking configuration, enhancing their effectiveness. These findings offer insights into the intrinsic properties of Transformers and MLP-Mixers and their theoretical underpinnings, providing a robust framework for future model design and optimization.

associative memory model, hopfield network, mlp-mixer, (13 more...)

arXiv.org Machine Learning

2406.1222

Country: Europe > Netherlands > North Holland > Amsterdam (0.04)

Genre: Research Report (0.64)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

NiNformer: A Network in Network Transformer with Token Mixing Generated Gating Function

Abdullah, Abdullah Nazhat, Aydin, Tarkan

arXiv.org Artificial IntelligenceJun-13-2024

The attention mechanism is the main component of the transformer architecture, and since its introduction, it has led to significant advancements in deep learning that span many domains and multiple tasks. The attention mechanism was utilized in computer vision as the Vision Transformer ViT, and its usage has expanded into many tasks in the vision domain, such as classification, segmentation, object detection, and image generation. While this mechanism is very expressive and capable, it comes with the drawback of being computationally expensive and requiring datasets of considerable size for effective optimization. To address these shortcomings, many designs have been proposed in the literature to reduce the computational burden and alleviate the data size requirements. Examples of such attempts in the vision domain are the MLP-Mixer, the Conv-Mixer, the Perciver-IO, and many more. This paper introduces a new computational block as an alternative to the standard ViT block that reduces the compute burdens by replacing the normal attention layers with a Network in Network structure that enhances the static approach of the MLP-Mixer with a dynamic system of learning an element-wise gating function by a token mixing process. Extensive experimentation shows that the proposed design provides better performance than the baseline architectures on multiple datasets applied in the image classification task of the vision domain.

architecture, nonresidual path, transformer, (14 more...)

arXiv.org Artificial Intelligence

2403.02411

Country:

South America > Chile > Santiago Metropolitan Region > Santiago Province > Santiago (0.04)
Asia > Middle East > Republic of Türkiye (0.04)
North America > Canada > Ontario > Toronto (0.04)

Genre: Research Report (0.64)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback