Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena

Neural Information Processing Systems 

We argue that this discrepancy primarily arises due to existing evaluation that only measures LLMs'

Similar Docs  Excel Report  more

TitleSimilaritySource
None found