Counterfactual Reward Model Training for Bias Mitigation in Multimodal Reinforcement Learning