Designing an Evaluation Framework for Large Language Models in Astronomy Research