Learning from an Exploring Demonstrator: Optimal Reward Estimation for Bandits