Learning a Diffusion Model Policy from Rewards via Q-Score Matching