Much of the recent progress has been achieved by taking advantage of continuous relaxations of the system, which are not always available or even possible.
Policy optimization, i.e. algorithms that learn to make sequential decisions by local search on the agent's policy directly, is a widely used class of algorithms in reinforcement learning [40, 44, 45].