Hedging using reinforcement learning: Contextual $k$-Armed Bandit versus $Q$-learning