Soft Policy Gradient Method for Maximum Entropy Deep Reinforcement Learning