Eye of Judgement: Dissecting the Evaluation of Russian-speaking LLMs with POLLUX

Martynov, Nikita, Mordasheva, Anastasia, Gorbetskiy, Dmitriy, Astafurov, Danil, Isaeva, Ulyana, Basyrova, Elina, Skachkov, Sergey, Berestova, Victoria, Ivanov, Nikolay, Zanina, Valeriia, Fenogenova, Alena

Dec-2-2025–arXiv.org Artificial Intelligence

The full statistics of all the criteria grouped by the panel assignments are presented in Table 7. Tables 8 and A.1 represent the statistics of the generated scores and rationales for criteria annotation. As we can see, the distributions of criterion-based scores for most criteria are largely comparable between expert-written and synthetic datasets, despite the underlying evaluated instruction-answer pairs being entirely distinct and non-overlapping. This is particularly evident in the mean, standard deviation, and mode of scores, which, across a wide range of criteria types, demonstrate close alignment - suggesting that criterion-level assessment remains consistent across both data sources. Tables 8 and A.1 suggest that synthetically generated texts (both instructions and rationales) are lengthier, being at the same time less original than those written by the experts. Tables also show that DeepSeek-R1 tends to assign a mediocre score of 1 rather than choosing extreme values. Despite these statistical and stylistic differences in commentary, the synthetic dataset remains a viable resource for training the LLM-as-a-Judge Family, especially considering the overall similarity in criterion-based scores. Thus, while the expert-written feedback exhibits optimized brevity and contextual appropriateness, the synthetic commentary maintains an adequate level of informative-ness and coherence.

criteria, large language model, machine learning, (20 more...)

arXiv.org Artificial Intelligence

Dec-2-2025

arXiv.org PDF

Add feedback

Country:
- Asia
  - Middle East
    - Saudi Arabia > Asir Province
      - Abha (0.04)
    - UAE > Abu Dhabi Emirate
      - Abu Dhabi (0.04)
  - Russia (0.04)
  - Thailand > Bangkok
    - Bangkok (0.04)
- Europe
  - Austria (0.04)
  - Ireland > Leinster
    - County Dublin > Dublin (0.04)
  - Italy > Tuscany
    - Florence (0.04)
  - Russia > Central Federal District
    - Moscow Oblast > Moscow (0.04)
  - Spain > Galicia
    - Madrid (0.04)
  - Ukraine > Kyiv Oblast
    - Kyiv (0.04)
- North America > Mexico
  - Mexico City > Mexico City (0.04)
- South America > Suriname
  - Marowijne District > Albina (0.04)

Genre:
- Research Report (1.00)

Technology:
- Information Technology > Artificial Intelligence
  - Machine Learning > Neural Networks
    - Deep Learning (1.00)
  - Natural Language
    - Chatbot (1.00)
    - Large Language Model (1.00)