SPG: Sandwiched Policy Gradient for Masked Diffusion Language Models