Evaluating Large Language Models for Evidence-Based Clinical Question Answering