On Polynomial Time PAC Reinforcement Learning with Rich Observations