Benchmarking and Confidence Evaluation of LALMs For Temporal Reasoning