Why can neural language models solve next-word prediction? A mathematical perspective