How to Train your Antivirus: RL-based Hardening through the Problem-Space