DPO: Differential reinforcement learning with application to optimal configuration search