Walking a Tightrope -- Evaluating Large Language Models in High-Risk Domains