Goto

Collaborating Authors

 irrep


Supplementary Material 7 Elements of Group and Representation Theory

Neural Information Processing Systems

In this section, we provide a brief introduction to the concepts from Group Theory which we need in our derivations. A group is a pair (G,)containing a set Gand a binary operation: G G! G,(h,g) 7! h g which satisfies the group axioms: Associativity: 8a,b,c 2 Ga (b c)=( a b) c Identity: 9e 2 G: 8g 2 Gg e = e g = g Inverse: 8g 2 G 9g 1 2 G: g g 1 = g 1 g = e The operation is the group law of G. The inverse elements g 1 of an element g, and the identity element e are unique. In addition, if the group law is also commutative, the group G is an abelian group. To simplify the notation, we commonly write ab instead of a b. It is also common to denote the group (G,) just with the name of its underlying set G. The order of a group G is the cardinality of its set and is indicated by |G|. A group G is finite when |G|2 N, i.e., when it has a finite number of elements. A compact group is a group that is also a compact topological space with continuous group operation. Given a group G, its action on a set X is a map . A simple example of group action is the group law itself: G G! Gwhich defines an action of G on its own elements (X = G). Another important action is the one defined on signals overs the group G. Given a signal x: G! R, the action of an element g 2 G maps x 7! g.x, [g.x](h):= x(g 1h).



APPENDIX AOverview of group representations

Neural Information Processing Systems

In this section we briefly introduce the representation theory of the three groups we used in this work. Planar rotations group SO(2) The standard representation of r 2 SO(2) is as a 2 2 rotation matrix (r)= cos sin sin cos The complex irreducible representations are often used and correspond to the circular harmonics. Planar rotations and reflections group O(2) The standard representation of O(2) is as a 2 2 orthogonal matrix (r)= cos sin sin cos and (r f)= cos sin sin cos 10 01 Apart from the trivial representation 0,0(h)=1 8h 2 O(2) and the sign-flip representation 1,0(r)=1 and 1,0(f)= 1, all other irreps are 2 dimensional. These representations are isomorphic to the Wigner D matrices. In particular, 0 is the trivial representation and i is isomorphic to the standard representation of SO(3) as 3 3 rotation matrices. An element g =( m,r) 2 O(3) is a pair of a mirroring m 2{ e,mz} and a rotation r 2 SO(3). In general, if G is a group, we denote with bG the set of its irreducible representations. Recall the generative process for cryo-EM images: oi = (g 1i) with gi 2 SO(3) (12) 14 Let Rz = SO(2) < SO(3) the subgroup of SO(3) containing rotations around the Z axis and H = O(2) < SO(3) the subgroup containing also the rotation ry by around the Y axis.








Provable Scaling Laws of Feature Emergence from Learning Dynamics of Grokking

arXiv.org Artificial Intelligence

While the phenomenon of grokking, i.e., delayed generalization, has been studied extensively, it remains an open problem whether there is a mathematical framework that characterizes what kind of features will emerge, how and in which conditions it happens, and is closely related to the gradient dynamics of the training, for complex structured inputs. We propose a novel framework, named $\mathbf{Li}_2$, that captures three key stages for the grokking behavior of 2-layer nonlinear networks: (I) Lazy learning, (II) independent feature learning and (III) interactive feature learning. At the lazy learning stage, top layer overfits to random hidden representation and the model appears to memorize, and at the same time, the backpropagated gradient $G_F$ from the top layer now carries information about the target label, with a specific structure that enables each hidden node to learn their representation independently. Interestingly, the independent dynamics follows exactly the gradient ascent of an energy function $E$, and its local maxima are precisely the emerging features. We study whether these local-optima induced features are generalizable, their representation power, and how they change on sample size, in group arithmetic tasks. When hidden nodes start to interact in the later stage of learning, we provably show how $G_F$ changes to focus on missing features that need to be learned. Our study sheds lights on roles played by key hyperparameters such as weight decay, learning rate and sample sizes in grokking, leads to provable scaling laws of feature emergence, memorization and generalization, and reveals why recent optimizers such as Muon can be effective, from the first principles of gradient dynamics. Our analysis can be extended to multi-layers. The code is available at https://github.com/yuandong-tian/understanding/tree/main/ssl/real-dataset/cogo.