Navigating Tomorrow: Reliably Assessing Large Language Models Performance on Future Event Prediction