On the Design of KL-Regularized Policy Gradient Algorithms for LLM Reasoning