Addressing Pitfalls in the Evaluation of Uncertainty Estimation Methods for Natural Language Generation