Which Prompts Make The Difference? Data Prioritization For Efficient Human LLM Evaluation

Boubdir, Meriem, Kim, Edward, Ermis, Beyza, Fadaee, Marzieh, Hooker, Sara

Oct-22-2023–arXiv.org Artificial Intelligence

Large language models (LLMs) have produced notable breakthroughs in downstream performance [61; 11; 19; 62; 91; 49; 8; 78], but have also introduced new challenges in model evaluation. The success of LLMs has initiated a fundamental paradigm shift away from small specialized models designed for single tasks to universal models expected to perform well across a wide range of tasks. This shift has also posed an existential challenge for evaluation, with a need to move away from solely task-specific automatic metrics of evaluation and increasing reliance on human evaluation. While automatic metrics offer a degree of objectivity and reproducibility, alongside the benefits of speed and cost-effectiveness, they often fall short in fully capturing the complexities and nuances of natural language [48; 68]. Moreover, automatic metrics often rely on auxiliary models which introduce potential points of failure and unexpected challenges over time [58]. For example, reference-based metrics such as BLEU [54] and ROUGE [45] are usually poor indicators of human judgment, as they emphasize lexical overlap and struggle to account for the diverse expressions inherent in semantic representation [34; 84; 9].

large language model, machine learning, natural language, (17 more...)

arXiv.org Artificial Intelligence

Oct-22-2023

arXiv.org PDF

Add feedback

Country:
- Asia (0.93)
- Europe > United Kingdom
  - England > Tyne and Wear (0.14)
- North America > United States
  - Minnesota > Hennepin County > Minneapolis (0.14)

Genre:
- Personal (0.92)
- Research Report > New Finding (0.67)

Industry:
- Health & Medicine > Therapeutic Area
  - Psychiatry/Psychology (0.93)
- Leisure & Entertainment
  - Games (0.94)
  - Sports (0.67)
- Transportation (1.00)

Technology:
- Information Technology > Artificial Intelligence
  - Machine Learning > Neural Networks
    - Deep Learning (0.67)
  - Natural Language > Large Language Model (1.00)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found