AITopics | evaluation process

Collaborating Authors

evaluation process

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

Bias in Evaluation Processes: An Optimization-Based Model

Neural Information Processing SystemsApr-30-2026, 02:40:51 GMT

Biases with respect to socially-salient attributes of individuals have been well documented in evaluation processes used in settings such as admissions and hiring. We view such an evaluation process as a transformation of a distribution of the true utility of an individual for a task to an observed distribution and model it as a solution to a loss minimization problem subject to an information constraint. Our model has two parameters that have been identified as factors leading to biases: the resource-information trade-off parameter in the information constraint and the risk-averseness parameter in the loss function. We characterize the distributions that arise from our model and study the effect of the parameters on the observed distribution. The outputs of our model enrich the class of distributions that can be used to capture variation across groups in the observed evaluations. We empirically validate our model by fitting real-world datasets and use it to study the effect of interventions in a downstream selection task. These results contribute to an understanding of the emergence of bias in evaluation processes and provide tools to guide the deployment of interventions to mitigate biases.

artificial intelligence, information management, machine learning, (19 more...)

Neural Information Processing Systems

Country:

North America > United States (1.00)
Asia (0.67)
Europe > United Kingdom > England (0.28)

Genre: Research Report > New Finding (0.45)

Industry:

Government > Regional Government > North America Government > United States Government (1.00)
Education > Educational Setting (1.00)
Law > Civil Rights & Constitutional Law (0.67)
Health & Medicine (0.67)

Technology:

Information Technology > Communications (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Optimization (0.70)
Information Technology > Information Management (0.67)

Add feedback

Bias in Evaluation Processes: An Optimization-Based Model L. Elisa Celis Y ale University Amit Kumar IIT Delhi Anay Mehrotra Y ale University Nisheeth K. Vishnoi Y ale University

Neural Information Processing SystemsFeb-17-2026, 16:07:32 GMT

In these processes, an evaluator estimates an individual's value to an institution.

artificial intelligence, intervention, machine learning, (18 more...)

Neural Information Processing Systems

Country:

Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.14)
Europe > Spain > Catalonia > Barcelona Province > Barcelona (0.04)
North America > United States > California (0.04)
(8 more...)

Genre: Research Report > New Finding (0.67)

Industry:

Government > Regional Government > North America Government > United States Government (1.00)
Education > Educational Setting (1.00)
Law > Civil Rights & Constitutional Law (0.67)
Health & Medicine (0.67)

Technology:

Information Technology > Communications (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Optimization (0.70)

Add feedback

Improving Auto-Augment via Augmentation-Wise Weight Sharing

Neural Information Processing SystemsDec-24-2025, 18:30:42 GMT

The recent progress on automatically searching augmentation policies has boosted the performance substantially for various tasks. A key component of automatic augmentation search is the evaluation process for a particular augmentation policy, which is utilized to return reward and usually runs thousands of times. A plain evaluation process, which includes full model training and validation, would be time-consuming. To achieve efficiency, many choose to sacrifice evaluation reliability for speed. In this paper, we dive into the dynamics of augmented training of the model. This inspires us to design a powerful and efficient proxy task based on the Augmentation-Wise Weight Sharing (AWS) to form a fast yet accurate evaluation process in an elegant way.

augmentation-wise weight, auto-augment, name change, (6 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Machine Learning (0.43)

Add feedback

Stress Testing Deliberative Alignment for Anti-Scheming Training

Schoen, Bronson, Nitishinskaya, Evgenia, Balesni, Mikita, Højmark, Axel, Hofstätter, Felix, Scheurer, Jérémy, Meinke, Alexander, Wolfe, Jason, van der Weij, Teun, Lloyd, Alex, Goldowsky-Dill, Nicholas, Fan, Angela, Matveiakin, Andrei, Shah, Rusheb, Williams, Marcus, Glaese, Amelia, Barak, Boaz, Zaremba, Wojciech, Hobbhahn, Marius

arXiv.org Artificial IntelligenceSep-22-2025

Highly capable AI systems could secretly pursue misaligned goals -- what we call "scheming". Because a scheming AI would deliberately try to hide its misaligned goals and actions, measuring and mitigating scheming requires different strategies than are typically used in ML. We propose that assessing anti-scheming interventions requires at least (1) testing propensity to scheme on far out-of-distribution (OOD) tasks, (2) evaluating whether lack of scheming is driven by situational awareness, and (3) checking for robustness to pre-existing misaligned goals. We use a broad category of "covert actions" -- such as secretly breaking rules or intentionally underperforming in tests -- as a proxy for scheming, and design evaluations for covert actions. We then stress-test deliberative alignment as a case study for anti-scheming. Across 26 OOD evaluations (180+ environments), deliberative alignment reduces covert action rates (OpenAI o3: 13%->0.4%) but does not fully eliminate them. Our mitigation is also able to largely stop agents from pursuing a hidden goal previously trained into the model, but we still find misbehavior after additional red-teaming. We find that models' chain-of-thought (CoT) often demonstrates awareness of being evaluated for alignment, and show causal evidence that this awareness decreases covert behavior, while unawareness increases it. Therefore, we cannot exclude that the observed reductions in covert action rates are at least partially driven by situational awareness. While we rely on human-legible CoT for training, studying situational awareness, and demonstrating clear evidence of misalignment, our ability to rely on this degrades as models continue to depart from reasoning in standard English. We encourage research into alignment mitigations for scheming and their assessment, especially for the adversarial case of deceptive alignment, which this paper does not address.

alignment training environment, large language model, machine learning, (21 more...)

arXiv.org Artificial Intelligence

2509.15541

Country: North America > United States (0.27)

Genre:

Research Report > New Finding (1.00)
Research Report > Experimental Study (1.00)

Industry:

Law (1.00)
Government > Military (0.74)
Government > Regional Government (0.67)
Health & Medicine > Pharmaceuticals & Biotechnology (0.67)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Natural Language > Chatbot (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

Online Submission and Evaluation System Design for Competition Operations

Chen, Zhe, Harabor, Daniel, Hechnenberger, Ryan, Sturtevant, Nathan R.

arXiv.org Artificial IntelligenceJul-24-2025

Research communities have developed benchmark datasets across domains to compare the performance of algorithms and techniques However, tracking the progress in these research areas is not easy, as publications appear in different venues at the same time, and many of them claim to represent the state-of-the-art. To address this, research communities often organise periodic competitions to evaluate the performance of various algorithms and techniques, thereby tracking advancements in the field. However, these competitions pose a significant operational burden. The organisers must manage and evaluate a large volume of submissions. Furthermore, participants typically develop their solutions in diverse environments, leading to compatibility issues during the evaluation of their submissions. This paper presents an online competition system that automates the submission and evaluation process for a competition. The competition system allows organisers to manage large numbers of submissions efficiently, utilising isolated environments to evaluate submissions. This system has already been used successfully for several competitions, including the Grid-Based Pathfinding Competition and the League of Robot Runners competition.

artificial intelligence, competition, programming language, (14 more...)

arXiv.org Artificial Intelligence

2507.1773

Country:

North America > United States (0.28)
North America > Canada (0.28)

Genre: Research Report (0.40)

Industry:

Information Technology (0.49)
Education (0.46)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (0.48)
Information Technology > Software > Programming Languages (0.47)

Add feedback

Enhancing Selection of Climate Tech Startups with AI -- A Case Study on Integrating Human and AI Evaluations in the ClimaTech Great Global Innovation Challenge

Turliuk, Jennifer, Sevilla, Alejandro, Gorza, Daniela, Hynes, Tod

arXiv.org Artificial IntelligenceMay-29-2025

This case study examines the ClimaTech Great Global Innovation Challenge's approach to selecting climate tech startups by integrating human and AI evaluations. The competition aimed to identify top startups and enhance the accuracy and efficiency of the selection process through a hybrid model. Research shows data-driven approaches help VC firms reduce bias and improve decision-making. Machine learning models have outperformed human investors in deal screening, helping identify high-potential startups. Incorporating AI aimed to ensure more equitable and objective evaluations. The methodology included three phases: initial AI review, semi-finals judged by humans, and finals using a hybrid weighting. In phase one, 57 applications were scored by an AI tool built with StackAI and OpenAI's GPT-4o, and the top 36 advanced. In the semi-finals, human judges, unaware of AI scores, evaluated startups on team quality, market potential, and technological innovation. Each score - human or AI - was weighted equally, resulting in 75 percent human and 25 percent AI influence. In the finals, with five human judges, weighting shifted to 83.3 percent human and 16.7 percent AI. There was a moderate positive correlation between AI and human scores - Spearman's = 0.47 - indicating general alignment with key differences. Notably, the final four startups, selected mainly by humans, were among those rated highest by the AI. This highlights the complementary nature of AI and human judgment. The study shows that hybrid models can streamline and improve startup assessments. The ClimaTech approach offers a strong framework for future competitions by combining human expertise with AI capabilities.

large language model, machine learning, natural language, (19 more...)

arXiv.org Artificial Intelligence

2505.21562

Genre: Research Report > Experimental Study (0.66)

Industry:

Banking & Finance > Trading (0.48)
Banking & Finance > Capital Markets (0.34)

Technology:

Information Technology > Artificial Intelligence > Issues > Social & Ethical Issues (0.62)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.55)
Information Technology > Artificial Intelligence > Natural Language > Chatbot (0.55)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning > Generative AI (0.34)

Add feedback

Evaluating the Performance of Nigerian Lecturers using Multilayer Perceptron

Ezeibe, I. E., Okide, S. O., Asogwa, D. C.

arXiv.org Artificial IntelligenceMay-26-2025

Evaluating the performance of a lecturer has been essential for enhancing teaching quality, improving student learning outcomes, and strengthening the institution's reputation. The absence of such a system brings about lecturer performance evaluation which was neither comprehensive nor holistic. This system was designed using a web-based platform, created a secure database, and by using a custom dataset, captured some performance metrics which included student evaluation scores, Research Publications, Years of Experience, and Administrative Duties. Multilayer Perceptron (MLP) algorithm was utilized due to its ability to process complex data patterns and generates accurate predictions in a lecturer's performance based on historical data. This research focused on designing multiple performance metrics beyond the standard ones, incorporating student participation, and integrating analytical tools to deliver a comprehensive and holistic evaluation of lecturers' performance and was developed using Object-Oriented Analysis and Design (OOAD) methodology. Lecturers' performance is evaluated by the model, and the evaluation accuracy is about 91% compared with actual performance. Finally, by evaluating the performance of the MLP model, it is concluded that MLP enhanced lecturer performance evaluation by providing accurate predictions, reducing bias, and supporting data-driven decisions, ultimately improving the fairness and efficiency of the evaluation process. The MLP model's performance was evaluated using Mean Squared Error (MSE) and Mean Absolute Error (MAE), achieved a test loss (MSE) of 256.99 and a MAE of 13.76, and reflected a high level of prediction accuracy. The model also demonstrated an estimated accuracy rate of approximately 96%, validated its effectiveness in predicting lecturer performance.

artificial intelligence, lecturer, machine learning, (14 more...)

arXiv.org Artificial Intelligence

2505.17143

Country:

Africa (0.31)
Asia > Indonesia > Java > West Java (0.15)

Genre:

Research Report (0.66)
Instructional Material > Course Syllabus & Notes (0.48)

Industry: Education (1.00)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Perceptrons (1.00)

Add feedback

Cer-Eval: Certifiable and Cost-Efficient Evaluation Framework for LLMs

Wang, Ganghua, Chen, Zhaorun, Li, Bo, Xu, Haifeng

arXiv.org Machine LearningMay-8-2025

As foundation models continue to scale, the size of trained models grows exponentially, presenting significant challenges for their evaluation. Current evaluation practices involve curating increasingly large datasets to assess the performance of large language models (LLMs). However, there is a lack of systematic analysis and guidance on determining the sufficiency of test data or selecting informative samples for evaluation. This paper introduces a certifiable and cost-efficient evaluation framework for LLMs. Our framework adapts to different evaluation objectives and outputs confidence intervals that contain true values with high probability. We use ``test sample complexity'' to quantify the number of test points needed for a certifiable evaluation and derive tight bounds on test sample complexity. Based on the developed theory, we develop a partition-based algorithm, named Cer-Eval, that adaptively selects test points to minimize the cost of LLM evaluation. Real-world experiments demonstrate that Cer-Eval can save 20% to 40% test points across various benchmarks, while maintaining an estimation error level comparable to the current evaluation process and providing a 95% confidence guarantee.

artificial intelligence, large language model, natural language, (17 more...)

arXiv.org Machine Learning

2505.03814

Country: