Bilevel reinforcement learning via the development of hyper-gradient without lower-level convexity