Initialization is Critical to Whether Transformers Fit Composite Functions by Inference or Memorizing

Open in new window