How Does Beam Search improve Span-Level Confidence Estimation in Generative Sequence Labeling?