Online learning in episodic Markovian decision processes by relative entropy policy search