Adaptively evaluating models with task elicitation