Eye of Judgement: Dissecting the Evaluation of Russian-speaking LLMs with POLLUX

Martynov, Nikita, Mordasheva, Anastasia, Gorbetskiy, Dmitriy, Astafurov, Danil, Isaeva, Ulyana, Basyrova, Elina, Skachkov, Sergey, Berestova, Victoria, Ivanov, Nikolay, Zanina, Valeriia, Fenogenova, Alena

arXiv.org Artificial Intelligence 

The full statistics of all the criteria grouped by the panel assignments are presented in Table 7. Tables 8 and A.1 represent the statistics of the generated scores and rationales for criteria annotation. As we can see, the distributions of criterion-based scores for most criteria are largely comparable between expert-written and synthetic datasets, despite the underlying evaluated instruction-answer pairs being entirely distinct and non-overlapping. This is particularly evident in the mean, standard deviation, and mode of scores, which, across a wide range of criteria types, demonstrate close alignment - suggesting that criterion-level assessment remains consistent across both data sources. Tables 8 and A.1 suggest that synthetically generated texts (both instructions and rationales) are lengthier, being at the same time less original than those written by the experts. Tables also show that DeepSeek-R1 tends to assign a mediocre score of 1 rather than choosing extreme values. Despite these statistical and stylistic differences in commentary, the synthetic dataset remains a viable resource for training the LLM-as-a-Judge Family, especially considering the overall similarity in criterion-based scores. Thus, while the expert-written feedback exhibits optimized brevity and contextual appropriateness, the synthetic commentary maintains an adequate level of informative-ness and coherence.