Evaluating Large Language Models with fmeval
Schwöbel, Pola, Franceschi, Luca, Zafar, Muhammad Bilal, Vasist, Keerthan, Malhotra, Aman, Shenhar, Tomer, Tailor, Pinal, Yilmaz, Pinar, Diamond, Michael, Donini, Michele
–arXiv.org Artificial Intelligence
fmeval is an open source library to evaluate large language models (LLMs) in a range of tasks. It helps practitioners evaluate their model for task performance and along multiple responsible AI dimensions. This paper presents the library and exposes its underlying design principles: simplicity, coverage, extensibility and performance. We then present how these were implemented in the scientific and engineering choices taken when developing fmeval. A case study demonstrates a typical use case for the library: picking a suitable model for a question answering task. We close by discussing limitations and further work in the development of the library. fmeval can be found at https://github.com/aws/fmeval.
arXiv.org Artificial Intelligence
Jul-15-2024
- Country:
- Asia (0.93)
- Europe > Germany
- Berlin (0.14)
- North America > United States (0.68)
- Genre:
- Research Report (0.40)
- Industry:
- Government (0.46)
- Technology: