References [1 ] Qiskit: Anopen-sourceframeworkforquantumcomputing,2019

Neural Information Processing Systems 

If during an entire episode of placing L gates the threshold ξ was never reached a reward of 5 is issued. The extreme reward values 5 are crucial for the performanceoftheagent. Given this figure of merit, a circuit with a smaller number of gates yields a higher discounted sum of rewards. This could be achieved, e.g., by using automated postprocessing methods to optimize the circuits (e.g. a Qiskit Terra transpiler [1]). For instance, the vast majority of rotations gates used by the agent are RY gates, in all cases we analyzed.