Optimizing LLM test-time compute involves solving a meta-RL problem

Open in new window