Reliable, Reproducible, and Really Fast Leaderboards with Evalica
–arXiv.org Artificial Intelligence
The rapid advancement of natural language processing (NLP) technologies, such as instruction-tuned large language models (LLMs), urges the development of modern evaluation protocols with human and machine feedback. We introduce Evalica, an open-source toolkit that facilitates the creation of reliable Figure 1: Evalica facilitates the highlighted aspects of and reproducible model leaderboards. This leaderboard-making that involve aggregation of judgements, paper presents its design, evaluates its performance, scoring the models with bootstrapped confidence and demonstrates its usability through intervals (CIs), and getting the final model ranks.
arXiv.org Artificial Intelligence
Dec-15-2024
- Country:
- Europe > Serbia
- Central Serbia > Belgrade (0.04)
- North America > United States
- New York (0.04)
- Europe > Serbia
- Genre:
- Research Report > New Finding (0.47)
- Technology: