Appendix
–Neural Information Processing Systems
Inthis section, we provide additional discussions of applying decision-focused learning toMDPs problems. Specifically, the assumption on smooth policy is similar to the idea of soft Q-learning [12] and soft actor-critic [13]proposed by Haarnoja et al. The randomly initiated neural network uses ReLU layers asnonlinearity followed byalinear layer intheend. Training parameters Across all three examples, we consider the discounted setting where the discount factor isγ = 0.95. Torelax the optimal policygivenbythe RL solver,we relax the Bellman equation used to run value-iteration by relaxing all the argmax and max operators in theBellman equation tosoftmax with temperature0.1,i.e., weuseSOFTMAX(0.1 Q-values)to replace all the argmax over Q values.
Neural Information Processing Systems
Feb-8-2026, 12:35:10 GMT
- Technology: