Evaluating the performance and fragility of large language models on the self-assessment for neurological surgeons