Intentionally-underestimated Value Function at Terminal State for Temporal-difference Learning with Mis-designed Reward