Linear attention is (maybe) all you need (to understand transformer optimization)