Direct Policy Gradients: Direct Optimization of Policies in Discrete Action Spaces

Neural Information Processing Systems 

Many problems in machine learning reduce to learning a probability distribution (or policy) over sequences of discrete actions so as to maximize a downstream utility function. Examples include generating text sequences to maximize a task-specific metric like BLEU and generating action sequences in reinforcement learning (RL) to maximize expected return.

Similar Docs  Excel Report  more

TitleSimilaritySource
None found