Goto

Collaborating Authors

 hopfield





1fb36c4ccf88f7e67ead155496f02338-Paper.pdf

Neural Information Processing Systems

Throughout our lives, we learn a huge number of associations between concepts: the taste of a particularfood,themeaningofagesture,ortostopwhenweseearedlight.


Self-Evidencing Through Hierarchical Gradient Decomposition: A Dissipative System That Maintains Non-Equilibrium Steady-State by Minimizing Variational Free Energy

McCulloch, Michael James

arXiv.org Artificial Intelligence

The Free Energy Principle (FEP) states that self-organizing systems must minimize variational free energy to persist (Friston, 2010, 2019), but the path from principle to implementable algorithm has remained unclear. We present a constructive proof that the FEP can be realized through exact local credit assignment. The system decomposes gradient computation hierarchically: spatial credit via feedback alignment, temporal credit via eligibility traces, and structural credit via a Trophic Field Map (TFM) that estimates expected gradient magnitude for each connection block. We prove these mechanisms are exact at their respective levels and validate the central claim empirically: the TFM achieves 0.9693 Pearson correlation with oracle gradients. This exactness produces emergent capabilities including 98.6% retention after task interference, autonomous recovery from 75% structural damage, self-organized criticality (spectral radius ρ 1.0), and sample-efficient reinforcement learning on continuous control tasks without replay buffers. The architecture unifies Pri-gogine's dissipative structures (Prigogine, 1977), Fris-ton's free energy minimization (Friston, 2010), and Hopfield's attractor dynamics (Hopfield, 1982; Amit et al., 1985a,b), demonstrating that exact hierarchical inference over network topology can be implemented with local, biologically plausible rules.



In-Context Algorithm Emulation in Fixed-Weight Transformers

Hu, Jerry Yao-Chieh, Liu, Hude, Zhang, Jennifer Yuntong, Liu, Han

arXiv.org Machine Learning

We prove that a minimal Transformer architecture with frozen weights is capable of emulating a broad class of algorithms by in-context prompting. In particular, for any algorithm implementable by a fixed-weight attention head (e.g. one-step gradient descent or linear/ridge regression), there exists a prompt that drives a two-layer softmax attention module to reproduce the algorithm's output with arbitrary precision. This guarantee extends even to a single-head attention layer (using longer prompts if necessary), achieving architectural minimality. Our key idea is to construct prompts that encode an algorithm's parameters into token representations, creating sharp dot-product gaps that force the softmax attention to follow the intended computation. This construction requires no feed-forward layers and no parameter updates. All adaptation happens through the prompt alone. These findings forge a direct link between in-context learning and algorithmic emulation, and offer a simple mechanism for large Transformers to serve as prompt-programmable libraries of algorithms. They illuminate how GPT-style foundation models may swap algorithms via prompts alone, establishing a form of algorithmic universality in modern Transformer models.



Effects of Feature Correlations on Associative Memory Capacity

Bielmeier, Stefan, Friedland, Gerald

arXiv.org Machine Learning

We investigate how feature correlations influence the capacity of Dense Associative Memory (DAM), a Transformer attention-like model. Practical machine learning scenarios involve feature-correlated data and learn representations in the input space, but current capacity analyses do not account for this. We develop an empirical framework to analyze the effects of data structure on capacity dynamics. Specifically, we systematically construct datasets that vary in feature correlation and pattern separation using Hamming distance from information theory, and compute the model's corresponding storage capacity using a simple binary search algorithm. Our experiments confirm that memory capacity scales exponentially with increasing separation in the input space. Feature correlations do not alter this relationship fundamentally, but reduce capacity slightly at constant separation. This effect is amplified at higher polynomial degrees in the energy function, suggesting that Associative Memory is more limited in depicting higher-order interactions between features than patterns. Our findings bridge theoretical work and practical settings for DAM, and might inspire more data-centric methods.


In-context denoising with one-layer transformers: connections between attention and associative memory retrieval

Smart, Matthew, Bietti, Alberto, Sengupta, Anirvan M.

arXiv.org Artificial Intelligence

We introduce in-context denoising, a task that refines the connection between attention-based architectures and dense associative memory (DAM) networks, also known as modern Hopfield networks. Using a Bayesian framework, we show theoretically and empirically that certain restricted denoising problems can be solved optimally even by a single-layer transformer. We demonstrate that a trained attention layer processes each denoising prompt by performing a single gradient descent update on a context-aware DAM energy landscape, where context tokens serve as associative memories and the query token acts as an initial state. This one-step update yields better solutions than exact retrieval of either a context token or a spurious local minimum, providing a concrete example of DAM networks extending beyond the standard retrieval paradigm. Overall, this work solidifies the link between associative memory and attention mechanisms first identified by Ramsauer et al., and demonstrates the relevance of associative memory models in the study of in-context learning.