Structured Prompting Enables More Robust Evaluation of Language Models