AITopics | evaluation criteria

Collaborating Authors

evaluation criteria

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

Knowledge Editing Benchmark

Neural Information Processing SystemsJun-18-2026, 05:26:59 GMT

Model editing aims to efficiently revise incorrect or outdated knowledge within LLMs without incurring the high cost of full retraining and risking catastrophic forgetting. Currently, most LLM editing datasets are confined to narrow knowledge domains and cover a limited range of editing evaluation. They often overlook the broad scope of editing demands and the diversity of ripple effects resulting from edits. In this context, we introduce UNIEDIT, a unified benchmark for LLM editing grounded in open-domain knowledge. First, we construct editing samples by selecting entities from 25 common domains across five major categories, utilizing the extensive triple knowledge available in open-domain knowledge graphs to ensure comprehensive coverage of the knowledge domains. To address the issues of generality and locality in editing, we design an Neighborhood Multi-hop Chain Sampling (NMCS) algorithm to sample subgraphs based on a given knowledge piece to entail comprehensive ripple effects to evaluate. Finally, we employ proprietary LLMs to convert the sampled knowledge subgraphs into natural language text, guaranteeing grammatical accuracy and syntactical diversity. Extensive statistical analysis confirms the scale, comprehensiveness, and diversity of our UNIEDIT benchmark. We conduct comprehensive experiments across multiple LLMs and editors, analyzing their performance to highlight strengths and weaknesses in editing across open knowledge domains and various evaluation criteria, thereby offering valuable insights for future research endeavors.

large language model, machine learning, natural language, (18 more...)

Neural Information Processing Systems

Country:

North America > United States (1.00)
Asia > China (0.68)
Africa (0.67)
Asia > Middle East > UAE (0.28)

Genre: Research Report > Experimental Study (1.00)

Industry:

Health & Medicine > Pharmaceuticals & Biotechnology (0.67)
Health & Medicine > Therapeutic Area > Immunology (0.46)
Education > Curriculum (0.46)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

Bridging Human and LLMJudgments: Understanding and Narrowing the Gap

Neural Information Processing SystemsJun-15-2026, 02:32:37 GMT

Large language models are increasingly used as judges (LLM-as-a-judge) to evaluate model outputs at scale, but their assessments often diverge systematically from human judgments.

density 0, large language model, machine learning, (19 more...)

Neural Information Processing Systems

Genre:

Research Report > New Finding (1.00)
Research Report > Experimental Study (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.30)

Add feedback

86bcae6da75c72e32f30a5553f094c06-Paper-Conference.pdf

Neural Information Processing SystemsFeb-15-2026, 16:19:02 GMT

data mining, dirichlet abstraction, machine learning, (18 more...)

Neural Information Processing Systems

Country:

North America > Canada > Ontario > Toronto (0.14)
Asia > Singapore (0.04)
North America > United States > Massachusetts > Middlesex County > Cambridge (0.04)
Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)

Genre: Research Report (0.68)

Industry:

Information Technology > Security & Privacy (1.00)
Health & Medicine (0.93)

Technology:

Information Technology > Security & Privacy (1.00)
Information Technology > Data Science > Data Mining (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)
(5 more...)

Add feedback

AI might not be coming for lawyers' jobs anytime soon

MIT Technology ReviewDec-15-2025, 10:00:00 GMT

AI might not be coming for lawyers' jobs anytime soon Generative AI might have aced the bar exam, but an LLM still can't think like a lawyer. When the generative AI boom took off in 2022, Rudi Miller and her law school classmates were suddenly gripped with anxiety. "Before graduating, there was discussion about what the job market would look like for us if AI became adopted," she recalls. So when it came time to choose a speciality, Miller--now a junior associate at the law firm Orrick--decided to become a litigator, the kind of lawyer who represents clients in court. She hoped the courtroom would be the last human stage. "Judges haven't allowed ChatGPT-enabled robots to argue in court yet," she says.

junior associate, law firm, lawyer, (15 more...)

MIT Technology Review

Country:

North America > United States > Pennsylvania (0.04)
North America > United States > Massachusetts (0.04)

Genre: Research Report (0.69)

Industry:

Law (1.00)
Education > Educational Setting > Higher Education (0.70)
Education > Curriculum > Subject-Specific Education (0.70)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Natural Language > Chatbot (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning > Generative AI (0.56)

Add feedback

Towards Personalized Deep Research: Benchmarks and Evaluations

Liang, Yuan, Li, Jiaxian, Wang, Yuqing, Wang, Piaohong, Tian, Motong, Liu, Pai, Qiao, Shuofei, Fang, Runnan, Zhu, He, Zhang, Ge, Liu, Minghao, Jiang, Yuchen Eleanor, Zhang, Ningyu, Zhou, Wangchunshu

arXiv.org Artificial IntelligenceDec-12-2025

Deep Research Agents (DRAs) can autonomously conduct complex investigations and generate comprehensive reports, demonstrating strong real-world potential. However, existing benchmarks primarily evaluate DRAs on generic quality metrics and overlook personalization, a critical dimension for individual users. However, existing evaluations mostly rely on close-ended benchmarks, while open-ended deep research benchmarks remain scarce and typically neglect personalized scenarios. To bridge this gap, we introduce Personalized Deep Research Bench (PDR-Bench), the first benchmark for evaluating personalization in DRAs. It pairs 50 diverse research tasks across 10 domains with 25 authentic user profiles that combine structured persona attributes with dynamic real-world contexts, yielding 250 realistic user-task queries. To assess system performance, we propose the PQR Evaluation Framework, which jointly measures Personalization Alignment, Content Quality, and Factual Reliability. Our experiments on a range of systems highlight current capabilities and limitations in handling personalized deep research. This work establishes a rigorous foundation for developing and evaluating the next generation of truly personalized AI research assistants.

criterion, large language model, machine learning, (20 more...)

arXiv.org Artificial Intelligence

2509.25106

Country: Asia (0.28)

Genre: Research Report (1.00)

Industry:

Health & Medicine (0.67)
Information Technology > Security & Privacy (0.46)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.96)
Information Technology > Artificial Intelligence > Representation & Reasoning > Agents (0.67)

Add feedback

Automatic Essay Scoring and Feedback Generation in Basque Language Learning

Azurmendi, Ekhi, Arregi, Xabier, de Lacalle, Oier Lopez

arXiv.org Artificial IntelligenceDec-10-2025

This paper introduces the first publicly available dataset for Automatic Essay Scoring (AES) and feedback generation in Basque, targeting the CEFR C1 proficiency level. The dataset comprises 3,200 essays from HABE, each annotated by expert evaluators with criterion specific scores covering correctness, richness, coherence, cohesion, and task alignment enriched with detailed feedback and error examples. We fine-tune open-source models, including RoBERTa-EusCrawl and Latxa 8B/70B, for both scoring and explanation generation. Our experiments show that encoder models remain highly reliable for AES, while supervised fine-tuning (SFT) of Latxa significantly enhances performance, surpassing state-of-the-art (SoTA) closed-source systems such as GPT-5 and Claude Sonnet 4.5 in scoring consistency and feedback quality. We also propose a novel evaluation methodology for assessing feedback generation, combining automatic consistency metrics with expert-based validation of extracted learner errors. Results demonstrate that the fine-tuned Latxa model produces criterion-aligned, pedagogically meaningful feedback and identifies a wider range of error types than proprietary models. This resource and benchmark establish a foundation for transparent, reproducible, and educationally grounded NLP research in low-resource languages such as Basque.

large language model, machine learning, natural language, (18 more...)

arXiv.org Artificial Intelligence

2512.08713

Country:

Europe (0.46)
North America > United States (0.14)

Genre: Research Report > New Finding (1.00)

Industry:

Education > Assessment & Standards > Student Performance (0.73)
Education > Curriculum > Subject-Specific Education (0.50)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.69)

Add feedback

AgentEval: Generative Agents as Reliable Proxies for Human Evaluation of AI-Generated Content

Vu, Thanh, Nayak, Richi, Balasubramaniam, Thiru

arXiv.org Artificial IntelligenceDec-10-2025

Modern businesses are increasingly challenged by the time and expense required to generate and assess high-quality content. Human writers face time constraints, and extrinsic evaluations can be costly. While Large Language Models (LLMs) offer potential in content creation, concerns about the quality of AI-generated content persist. Traditional evaluation methods, like human surveys, further add operational costs, highlighting the need for efficient, automated solutions. This research introduces Generative Agents as a means to tackle these challenges. These agents can rapidly and cost-effectively evaluate AI-generated content, simulating human judgment by rating aspects such as coherence, interestingness, clarity, fairness, and relevance. By incorporating these agents, businesses can streamline content generation and ensure consistent, high-quality output while minimizing reliance on costly human evaluations. The study provides critical insights into enhancing LLMs for producing business-aligned, high-quality content, offering significant advancements in automated content generation and evaluation.

large language model, machine learning, natural language, (20 more...)

arXiv.org Artificial Intelligence

2512.08273

Country: Oceania > Australia (0.28)

Genre:

Research Report > New Finding (0.68)
Research Report > Experimental Study (0.48)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Natural Language > Chatbot (0.95)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.48)

Add feedback

Becoming Experienced Judges: Selective Test-Time Learning for Evaluators

Jwa, Seungyeon, Ahn, Daechul, Kim, Reokyoung, Kang, Dongyeop, Choi, Jonghyun

arXiv.org Artificial IntelligenceDec-9-2025

Automatic evaluation with large language models, commonly known as LLM-as-a-judge, is now standard across reasoning and alignment tasks. Despite evaluating many samples in deployment, these evaluators typically (i) treat each case independently, missing the opportunity to accumulate experience, and (ii) rely on a single fixed prompt for all cases, neglecting the need for sample-specific evaluation criteria. We introduce Learning While Evaluating (LWE), a framework that allows evaluators to improve sequentially at inference time without requiring training or validation sets. LWE maintains an evolving meta-prompt that (i) produces sample-specific evaluation instructions and (ii) refines itself through self-generated feedback. Furthermore, we propose Selective LWE, which updates the meta-prompt only on self-inconsistent cases, focusing computation where it matters most. This selective approach retains the benefits of sequential learning while being far more cost-effective. Across two pairwise comparison benchmarks, Selective LWE outperforms strong baselines, empirically demonstrating that evaluators can improve during sequential testing with a simple selective update, learning most from the cases they struggle with.

large language model, machine learning, natural language, (16 more...)

arXiv.org Artificial Intelligence

2512.06751

Genre: Research Report > New Finding (0.46)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

Eye of Judgement: Dissecting the Evaluation of Russian-speaking LLMs with POLLUX

Martynov, Nikita, Mordasheva, Anastasia, Gorbetskiy, Dmitriy, Astafurov, Danil, Isaeva, Ulyana, Basyrova, Elina, Skachkov, Sergey, Berestova, Victoria, Ivanov, Nikolay, Zanina, Valeriia, Fenogenova, Alena

arXiv.org Artificial IntelligenceDec-2-2025

The full statistics of all the criteria grouped by the panel assignments are presented in Table 7. Tables 8 and A.1 represent the statistics of the generated scores and rationales for criteria annotation. As we can see, the distributions of criterion-based scores for most criteria are largely comparable between expert-written and synthetic datasets, despite the underlying evaluated instruction-answer pairs being entirely distinct and non-overlapping. This is particularly evident in the mean, standard deviation, and mode of scores, which, across a wide range of criteria types, demonstrate close alignment - suggesting that criterion-level assessment remains consistent across both data sources. Tables 8 and A.1 suggest that synthetically generated texts (both instructions and rationales) are lengthier, being at the same time less original than those written by the experts. Tables also show that DeepSeek-R1 tends to assign a mediocre score of 1 rather than choosing extreme values. Despite these statistical and stylistic differences in commentary, the synthetic dataset remains a viable resource for training the LLM-as-a-Judge Family, especially considering the overall similarity in criterion-based scores. Thus, while the expert-written feedback exhibits optimized brevity and contextual appropriateness, the synthetic commentary maintains an adequate level of informative-ness and coherence.

criteria, large language model, machine learning, (20 more...)

arXiv.org Artificial Intelligence

2505.24616

Country:

Europe (1.00)
North America > Mexico (0.28)
Asia > Middle East (0.28)

Genre: Research Report (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Natural Language > Chatbot (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

UniEdit: A Unified Knowledge Editing Benchmark for Large Language Models

Chen, Qizhou, Wang, Dakan, Zhang, Taolin, Yan, Zaoming, You, Chengsong, Wang, Chengyu, He, Xiaofeng

arXiv.org Artificial IntelligenceNov-12-2025

Model editing aims to enhance the accuracy and reliability of large language models (LLMs) by efficiently adjusting their internal parameters. Currently, most LLM editing datasets are confined to narrow knowledge domains and cover a limited range of editing evaluation. They often overlook the broad scope of editing demands and the diversity of ripple effects resulting from edits. In this context, we introduce UniEdit, a unified benchmark for LLM editing grounded in open-domain knowledge. First, we construct editing samples by selecting entities from 25 common domains across five major categories, utilizing the extensive triple knowledge available in open-domain knowledge graphs to ensure comprehensive coverage of the knowledge domains. To address the issues of generality and locality in editing, we design an Neighborhood Multi-hop Chain Sampling (NMCS) algorithm to sample subgraphs based on a given knowledge piece to entail comprehensive ripple effects to evaluate. Finally, we employ proprietary LLMs to convert the sampled knowledge subgraphs into natural language text, guaranteeing grammatical accuracy and syntactical diversity. Extensive statistical analysis confirms the scale, comprehensiveness, and diversity of our UniEdit benchmark. We conduct comprehensive experiments across multiple LLMs and editors, analyzing their performance to highlight strengths and weaknesses in editing across open knowledge domains and various evaluation criteria, thereby offering valuable insights for future research endeavors.

large language model, machine learning, natural language, (19 more...)

arXiv.org Artificial Intelligence

2505.12345

Country:

North America > United States (1.00)
Asia > China (0.68)
Asia > Middle East > UAE (0.28)

Genre: Research Report > Experimental Study (1.00)

Industry:

Law (1.00)
Education (1.00)
Information Technology (0.67)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback