log 1
The Implicit Bias of Adam and Muon on Smooth Homogeneous Neural Networks
We study the implicit bias of momentum-based optimizers on homogeneous models. We first extend existing results on the implicit bias of steepest descent in homogeneous models to normalized steepest descent with an optional learning rate schedule. We then show that for smooth homogeneous models, momentum steepest descent algorithms like Muon (spectral norm), MomentumGD ($\ell_2$ norm), and Signum ($\ell_\infty$ norm) are approximate steepest descent trajectories under a decaying learning rate schedule, proving that these algorithms too have a bias towards KKT points of the corresponding margin maximization problem. We extend the analysis to Adam (without the stability constant), which maximizes the $\ell_\infty$ margin, and to Muon-Signum and Muon-Adam, which maximize a hybrid norm. Our experiments corroborate the theory and show that the identity of the margin maximized depends on the choice of optimizer. Overall, our results extend earlier lines of work on steepest descent in homogeneous models and momentum-based optimizers in linear models.
- Asia > Middle East > Jordan (0.04)
- North America > United States > California > San Diego County > San Diego (0.04)
- Asia > Middle East > Israel (0.04)
- Asia > Middle East > Jordan (0.05)
- North America > Canada > British Columbia > Metro Vancouver Regional District > Vancouver (0.04)
- North America > United States > Georgia > Fulton County > Atlanta (0.05)
- North America > United States > District of Columbia > Washington (0.05)
- Asia > Middle East > Israel > Haifa District > Haifa (0.05)
- (6 more...)
- Asia > China (0.04)
- Europe > Slovakia > Bratislava > Bratislava (0.04)
- Research Report > Experimental Study (1.00)
- Research Report > New Finding (0.67)
Entropy testing and its application to testing Bayesian networks
This paper studies the problem of entropy identity testing: given sample access to a distribution p and a fully described distribution q (both discrete distributions over a domain of size k), and the promise that either p = q or |H (p) H (q)| ε, where H () denotes the Shannon entropy, a tester needs to distinguish between the two cases with high probability.
- Asia > Middle East > Jordan (0.14)
- North America > United States > California > Alameda County > Berkeley (0.04)
- Asia > Middle East > Israel > Tel Aviv District > Tel Aviv (0.04)
- Information Technology > Security & Privacy (1.00)
- Information Technology > Communications > Social Media (0.68)
- Information Technology > Data Science > Data Mining (0.67)
- (2 more...)
- North America > United States > Massachusetts (0.04)
- North America > United States > Louisiana > Orleans Parish > New Orleans (0.04)
- Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
- (4 more...)
- North America > United States > Massachusetts > Middlesex County > Cambridge (0.04)
- North America > United States > Colorado > Denver County > Denver (0.04)
- North America > Canada > Ontario > Hamilton (0.04)
- (4 more...)
- Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
- Asia > Middle East > Jordan (0.04)
- Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
- Asia > Middle East > Jordan (0.04)