Optimizing the non-Clifford-count in unitary synthesis using Reinforcement Learning