Debiasing Meta-Gradient Reinforcement Learning by Learning the Outer Value Function