cifar-100
Vision Hopfield Memory Networks
Wang, Jianfeng, M'Charrak, Amine, Koska, Luk, Wang, Xiangtao, Petriceanu, Daniel, Smyrnov, Mykyta, Wang, Ruizhi, Bumbar, Michael, Pinchetti, Luca, Lukasiewicz, Thomas
Recent vision and multimodal foundation backbones, such as Transformer families and state-space models like Mamba, have achieved remarkable progress, enabling unified modeling across images, text, and beyond. Despite their empirical success, these architectures remain far from the computational principles of the human brain, often demanding enormous amounts of training data while offering limited interpretability. In this work, we propose the Vision Hopfield Memory Network (V-HMN), a brain-inspired foundation backbone that integrates hierarchical memory mechanisms with iterative refinement updates. Specifically, V-HMN incorporates local Hopfield modules that provide associative memory dynamics at the image patch level, global Hopfield modules that function as episodic memory for contextual modulation, and a predictive-coding-inspired refinement rule for iterative error correction. By organizing these memory-based modules hierarchically, V-HMN captures both local and global dynamics in a unified framework. Memory retrieval exposes the relationship between inputs and stored patterns, making decisions more interpretable, while the reuse of stored patterns improves data efficiency. This brain-inspired design therefore enhances interpretability and data efficiency beyond existing self-attention- or state-space-based approaches. We conducted extensive experiments on public computer vision benchmarks, and V-HMN achieved competitive results against widely adopted backbone architectures, while offering better interpretability, higher data efficiency, and stronger biological plausibility. These findings highlight the potential of V-HMN to serve as a next-generation vision foundation model, while also providing a generalizable blueprint for multimodal backbones in domains such as text and audio, thereby bridging brain-inspired computation with large-scale machine learning.
- North America > Canada > Ontario > Toronto (0.14)
- Europe > United Kingdom (0.04)
- Europe > Austria > Vienna (0.04)
- Information Technology > Sensing and Signal Processing > Image Processing (1.00)
- Information Technology > Artificial Intelligence > Vision (1.00)
- Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)
- Europe > Switzerland > Zürich > Zürich (0.14)
- North America > United States > Washington (0.04)
- Europe > Spain > Galicia > Madrid (0.04)
- Asia > Middle East > Jordan (0.04)
- Research Report > Experimental Study (1.00)
- Research Report > New Finding (0.67)
- Research Report > Experimental Study (1.00)
- Research Report > New Finding (0.67)
A Experimental setup
In this section, we detail the model architectures examined in the experiments and list all hyperpa-rameters used in the experiments. Both architectures consist of five stages, each consisting of a combination of convolutional layers with ReLU activation and max pooling layers. The base number of channels in consecutive stages for VGG architectures equals 64, 128, 256, 512, and 512. The subsequent stages are composed of residual blocks. In the case of ResNets, we report the results for the'conv2' layers.
- North America > Canada > Alberta (0.14)
- Europe > Poland > Masovia Province > Warsaw (0.04)
- North America > United States > Maryland > Baltimore (0.04)
- (2 more...)
- Asia > Singapore (0.04)
- Oceania > New Zealand > North Island > Auckland Region > Auckland (0.04)
- Asia > Japan > Honshū > Kantō > Tokyo Metropolis Prefecture > Tokyo (0.04)
- (4 more...)