Experts, Errors, and Context: A Large-Scale Study of Human Evaluation for Machine Translation