Which Prompts Make The Difference? Data Prioritization For Efficient Human LLM Evaluation
Boubdir, Meriem, Kim, Edward, Ermis, Beyza, Fadaee, Marzieh, Hooker, Sara
–arXiv.org Artificial Intelligence
Large language models (LLMs) have produced notable breakthroughs in downstream performance [61; 11; 19; 62; 91; 49; 8; 78], but have also introduced new challenges in model evaluation. The success of LLMs has initiated a fundamental paradigm shift away from small specialized models designed for single tasks to universal models expected to perform well across a wide range of tasks. This shift has also posed an existential challenge for evaluation, with a need to move away from solely task-specific automatic metrics of evaluation and increasing reliance on human evaluation. While automatic metrics offer a degree of objectivity and reproducibility, alongside the benefits of speed and cost-effectiveness, they often fall short in fully capturing the complexities and nuances of natural language [48; 68]. Moreover, automatic metrics often rely on auxiliary models which introduce potential points of failure and unexpected challenges over time [58]. For example, reference-based metrics such as BLEU [54] and ROUGE [45] are usually poor indicators of human judgment, as they emphasize lexical overlap and struggle to account for the diverse expressions inherent in semantic representation [34; 84; 9].
arXiv.org Artificial Intelligence
Oct-22-2023
- Country:
- Asia (0.93)
- Europe > United Kingdom
- England > Tyne and Wear (0.14)
- North America > United States
- Minnesota > Hennepin County > Minneapolis (0.14)
- Genre:
- Personal (0.92)
- Research Report > New Finding (0.67)
- Industry:
- Technology: