AITopics | k-fac

Collaborating Authors

k-fac

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

Kronecker-Factored Approximate Curvature for Modern Neural Network Architectures

Neural Information Processing SystemsDec-25-2025, 21:21:09 GMT

The core components of many modern neural network architectures, such as transformers, convolutional, or graph neural networks, can be expressed as linear layers with *weight-sharing*. Kronecker-Factored Approximate Curvature (K-FAC), a second-order optimisation method, has shown promise to speed up neural network training and thereby reduce computational costs. However, there is currently no framework to apply it to generic architectures, specifically ones with linear weight-sharing layers. In this work, we identify two different settings of linear weight-sharing layers which motivate two flavours of K-FAC -- *expand* and *reduce*. We show that they are exact for deep linear networks with weight-sharing in their respective setting. Notably, K-FAC-reduce is generally faster than K-FAC-expand, which we leverage to speed up automatic hyperparameter selection via optimising the marginal likelihood for a Wide ResNet. Finally, we observe little difference between these two K-FAC variations when using them to train both a graph neural network and a vision transformer. However, both variations are able to reach a fixed validation metric target in $50$-$75$\% of the number of steps of a first-order reference run, which translates into a comparable improvement in wall-clock time. This highlights the potential of applying K-FAC to modern neural network architectures.

kronecker-factored approximate curvature, modern neural network architecture, name change, (5 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)

Add feedback

From Memorization to Reasoning in the Spectrum of Loss Curvature

Merullo, Jack, Vatsavaya, Srihita, Bushnaq, Lucius, Lewis, Owen

arXiv.org Artificial IntelligenceNov-3-2025

We characterize how memorization is represented in transformer models and show that it can be disentangled in the weights of both language models (LMs) and vision transformers (ViTs) using a decomposition based on the loss landscape curvature. This insight is based on prior theoretical and empirical work showing that the curvature for memorized training points is much sharper than non memorized, meaning ordering weight components from high to low curvature can reveal a distinction without explicit labels. This motivates a weight editing procedure that suppresses far more recitation of untargeted memorized data more effectively than a recent unlearning method (BalancedSubnet), while maintaining lower perplexity. Since the basis of curvature has a natural interpretation for shared structure in model weights, we analyze the editing procedure extensively on its effect on downstream tasks in LMs, and find that fact retrieval and arithmetic are specifically and consistently negatively affected, even though open book fact retrieval and general logical reasoning is conserved. We posit these tasks rely heavily on specialized directions in weight space rather than general purpose mechanisms, regardless of whether those individual datapoints are memorized. We support this by showing a correspondence between task data's activation strength with low curvature components that we edit out, and the drop in task performance after the edit. Our work enhances the understanding of memorization in neural networks with practical applications towards removing it, and provides evidence for idiosyncratic, narrowly-used structures involved in solving tasks like math and fact retrieval.

machine learning, natural language, question answering, (20 more...)

arXiv.org Artificial Intelligence

2510.24256

Country:

North America > United States (0.46)
Europe (0.28)

Genre: Research Report > New Finding (0.46)

Industry: Information Technology (0.46)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Memory-Based Learning > Rote Learning (0.86)
Information Technology > Artificial Intelligence > Natural Language > Question Answering (0.74)

Add feedback

concerns below (due to space constraints, we focus on the main concerns): 2

Neural Information Processing SystemsOct-2-2025, 07:52:39 GMT

We thank the reviewers for their detailed reviews and constructive feedback. It is not known how tight any of these bounds are. We will clarify this point in the final version. Red lines are GD while blue lines are NGD (Hessian-free). Solid lines are training curves while dashed lines are testing curves.

artificial intelligence, machine learning, space constraint, (14 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Machine Learning (1.00)

Add feedback

d1ff1ec86b62cd5f3903ff19c3a326b2-AuthorFeedback.pdf

Neural Information Processing SystemsAug-16-2025, 14:23:58 GMT

We would like to thank the reviewers for their comments, and take the opportunity to answer their questions below. We thank the reviewer for the relevant [Amari et al., 2000] reference, which we will cite and discuss. Similarly, [Amari et al., 2000] considers single-layer networks Further, we examined the method's accuracy relative to recent techniques, and extended it to We are open to changing the term "WoodFisher" which we used as a mnemonic Please see Appendix S5 for ablation studies. For simplicity, we consider the scaling constant as 1 here. Thanks for the suggestions, we will correct the font sizes & the broken references.

pruning, reviewer, woodfisher, (13 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.30)

Add feedback

Review for NeurIPS paper: WoodFisher: Efficient Second-Order Approximation for Neural Network Compression

Neural Information Processing SystemsJun-1-2025, 00:36:32 GMT

Weaknesses: --- Missing details about lambda While mentioned line 138, the dampening parameter lambda does not appear in the experimental section of the main body, and I only found a value 1e-5 in the appendix (l799). How do you select its value? I expect your final algorithm be very sensitive to lambda, since \delta_L as defined in eq.4 will select directions with smallest curvature. Another comment about lambda is that if you set it to a very large value k, then its becomes dominant compared to the eigenvalues of F, then your technique basically amounts to magnitude pruning. In that regards, it means that MP is just a special case of your technique, when using a large dampening value.

efficient second-order approximation, neural network compression, woodfisher, (8 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.40)

Add feedback

Scalable Thermodynamic Second-order Optimization

Donatella, Kaelan, Duffield, Samuel, Melanson, Denis, Aifer, Maxwell, Klett, Phoebe, Salegame, Rajath, Belateche, Zach, Crooks, Gavin, Martinez, Antonio J., Coles, Patrick J.

arXiv.org Artificial IntelligenceFeb-12-2025

Many hardware proposals have aimed to accelerate inference in AI workloads. Less attention has been paid to hardware acceleration of training, despite the enormous societal impact of rapid training of AI models. Physics-based computers, such as thermodynamic computers, offer an efficient means to solve key primitives in AI training algorithms. Optimizers that normally would be computationally out-of-reach (e.g., due to expensive matrix inversions) on digital hardware could be unlocked with physics-based hardware. In this work, we propose a scalable algorithm for employing thermodynamic computers to accelerate a popular second-order optimizer called Kronecker-factored approximate curvature (K-FAC). Our asymptotic complexity analysis predicts increasing advantage with our algorithm as $n$, the number of neurons per layer, increases. Numerical experiments show that even under significant quantization noise, the benefits of second-order optimization can be preserved. Finally, we predict substantial speedups for large-scale vision and graph problems based on realistic hardware characteristics.

artificial intelligence, machine learning, matrix, (19 more...)

arXiv.org Artificial Intelligence

2502.08603

Country:

North America > United States > Massachusetts > Middlesex County > Cambridge (0.04)
North America > Canada > Ontario > Toronto (0.04)

Genre: Research Report (0.83)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.94)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.68)

Add feedback

Reviews: Fast Convergence of Natural Gradient Descent for Over-Parameterized Neural Networks

Neural Information Processing SystemsJan-22-2025, 03:57:16 GMT

After rebuttal: I have carefully read the comments from other reviewers and the feedback from the authors. My main concern was the generalization ability of NGD, but the experiments in the feedback are a bit confused to me because GD doesn't seem to achieve zero training loss but NGD converges to 0 very quickly in MNIST regression. I would suggest the authors provide more details about that experiment setting, e.g., how do you select the hyperparameter. Thus, I would like to keep my score unchanged. The framework for the proof follows the recent line of work about over-parametrization, e.g., the papers written by Du et al, Li and Liang, and Allen-Zhu et al., the core of which is the Gram matrix.

artificial intelligence, machine learning, natural gradient descent, (11 more...)

Neural Information Processing Systems

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.43)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.42)

Add feedback

Kronecker-Factored Approximate Curvature for Modern Neural Network Architectures

Neural Information Processing SystemsJan-19-2025, 01:40:21 GMT

kronecker-factored approximate curvature, linear weight-sharing layer, modern neural network architecture, (4 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)

Add feedback

Influence Functions for Scalable Data Attribution in Diffusion Models

Mlodozeniec, Bruno, Eschenhagen, Runa, Bae, Juhan, Immer, Alexander, Krueger, David, Turner, Richard

arXiv.org Artificial IntelligenceJan-7-2025

Diffusion models have led to significant advancements in generative modelling. Yet their widespread adoption poses challenges regarding data attribution and interpretability. In this paper, we aim to help address such challenges in diffusion models by developing an influence functions framework. Influence function-based data attribution methods approximate how a model's output would have changed if some training data were removed. In supervised learning, this is usually used for predicting how the loss on a particular example would change. For diffusion models, we focus on predicting the change in the probability of generating a particular example via several proxy measurements. We show how to formulate influence functions for such quantities and how previously proposed methods can be interpreted as particular design choices in our framework. To ensure scalability of the Hessian computations in influence functions, we systematically develop K-FAC approximations based on generalised Gauss-Newton matrices specifically tailored to diffusion models. We recast previously proposed methods as specific design choices in our framework and show that our recommended method outperforms previous data attribution approaches on common evaluations, such as the Linear Data-modelling Score (LDS) or retraining without top influences, without the need for method-specific hyperparameter tuning.

approximation, artificial intelligence, machine learning, (19 more...)

arXiv.org Artificial Intelligence

2410.1385

Country:

North America > Canada (0.67)
Europe > United Kingdom > England (0.28)

Genre: Research Report (0.64)

Industry: Law (0.93)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.68)

Add feedback

Reviews: Exact natural gradient in deep linear networks and its application to the nonlinear case

Neural Information Processing SystemsOct-7-2024, 15:22:46 GMT

The main result is that the natural gradient completely removes pathological curvature introduced by depth, yielding exponential convergence in the total weights (as though it were a shallow network). The paper traces connections to a variety of previous methods to approximate the Fisher information matrix, and shows a preliminary application of the method to nonlinear networks (for which it is no longer exact), where it appears to speed up convergence. Major comments: This paper presents an elegant analysis of learning dynamics under the natural gradient. Even though the results are obtained for deep linear networks, they are decisive for this case and suggest strongly that future work in this direction could bring principled benefits for the nonlinear case (as shown at small scale in the nonlinear auto encoder experiment). The analysis provides solid intuitions for prior work on approximating second order methods, including an interesting observation on the structure of the Hessian: it is far from block diagonal, a common assumption in prior work. Yet off diagonal blocks are repeats of diagonal blocks, yielding similar results.

application, natural gradient, nonlinear case, (11 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Representation & Reasoning (0.37)

Add feedback