Uncertainty-Penalized Reinforcement Learning from Human Feedback with Diverse Reward LoRA Ensembles