Continuous-time Risk-sensitive Reinforcement Learning via Quadratic Variation Penalty