Transformers Provably Learn Directed Acyclic Graphs via Kernel-Guided Mutual Information

Open in new window