Vanishing Gradients in Reinforcement Finetuning of Language Models