Multi-Layer Transformers Gradient Can be Approximated in Almost Linear Time

Open in new window