Direct Policy Gradients: Direct Optimization of Policies in Discrete Action Spaces
–Neural Information Processing Systems
Many problems in machine learning reduce to learning a probability distribution (or policy) over sequences of discrete actions so as to maximize a downstream utility function. Examples include generating text sequences to maximize a task-specific metric like BLEU and generating action sequences in reinforcement learning (RL) to maximize expected return.
Neural Information Processing Systems
Aug-16-2025, 14:22:19 GMT
- Country:
- Asia > Middle East
- Israel (0.04)
- Europe > Spain
- Canary Islands (0.04)
- North America
- Canada (0.04)
- United States > Maryland (0.04)
- Asia > Middle East
- Genre:
- Workflow (0.66)
- Technology: