Goto

Collaborating Authors

 Energy







Distributed Machine Learning with Sparse Heterogeneous Data

Neural Information Processing Systems

This increase in data sources has led to applications that are increasingly high-dimensional. To be both statistically and computationally efficient in this setting, it is then important to develop approaches that can exploit the structure within the data.


Free Energy Mixer

arXiv.org Machine Learning

Standard attention stores keys/values losslessly but reads them via a per-head convex average, blocking channel-wise selection. We propose the Free Energy Mixer (FEM): a free-energy (log-sum-exp) read that applies a value-driven, per-channel log-linear tilt to a fast prior (e.g., from queries/keys in standard attention) over indices. Unlike methods that attempt to improve and enrich the $(q,k)$ scoring distribution, FEM treats it as a prior and yields a value-aware posterior read at unchanged complexity, smoothly moving from averaging to per-channel selection as the learnable inverse temperature increases, while still preserving parallelism and the original asymptotic complexity ($O(T^2)$ for softmax; $O(T)$ for linearizable variants). We instantiate a two-level gated FEM that is plug-and-play with standard and linear attention, linear RNNs and SSMs. It consistently outperforms strong baselines on NLP, vision, and time-series at matched parameter budgets.




FormalizingtheGeneralization-ForgettingTrade-Off inContinualLearning

Neural Information Processing Systems

In continual learning (CL), we incrementally adapt a model to learn tasks (defined according to the problem at hand) observed sequentially. CL has two main objectives: maintain long-term memory (remember previous tasks) and navigate new experiences continually (quickly adapt to newtasks).