RL without TD learning
In this post, I'll introduce a reinforcement learning (RL) algorithm based on an "alternative" paradigm: divide and conquer We can do Reinforcement Learning (RL) based on divide and conquer, instead of temporal difference (TD) learning. There are two classes of algorithms in RL: on-policy RL and off-policy RL. On-policy RL means we can use fresh data collected by the current policy. In other words, we have to throw away old data each time we update the policy. Algorithms like PPO and GRPO (and policy gradient methods in general) belong to this category.
Dec-23-2025, 14:00:00 GMT
- Country:
- Europe > Netherlands
- North Holland > Amsterdam (0.04)
- South Holland > Leiden (0.04)
- Europe > Netherlands
- Technology: