How transformers learn structured data: insights from hierarchical filtering

Open in new window