Optimizing LLM test-time compute involves solving a meta-RL problem