Reinforcement Learning
Appendix
Inthis section, we provide additional discussions of applying decision-focused learning toMDPs problems. Specifically, the assumption on smooth policy is similar to the idea of soft Q-learning [12] and soft actor-critic [13]proposed by Haarnoja et al. The randomly initiated neural network uses ReLU layers asnonlinearity followed byalinear layer intheend. Training parameters Across all three examples, we consider the discounted setting where the discount factor isγ = 0.95. Torelax the optimal policygivenbythe RL solver,we relax the Bellman equation used to run value-iteration by relaxing all the argmax and max operators in theBellman equation tosoftmax with temperature0.1,i.e., weuseSOFTMAX(0.1 Q-values)to replace all the argmax over Q values.
Checklist
The checklist follows the references. Please do not modify the questions and only use the provided macros for your answers. Checklist section does not count towards the page limit. Do the main claims made in the abstract and introduction accurately reflect the paper's Did you describe the limitations of your work? Did you discuss any potential negative societal impacts of your work?