$\beta$-calibration of Language Model Confidence Scores for Generative QA