Goto

Collaborating Authors

 exam


AIhub coffee corner: AI, kids, and the future – "generation AI"

AIHub

This month we tackle the topic of young people and what AI tools mean for their future. Joining the conversation this time are: Sanmay Das (Virginia Tech), Tom Dietterich (Oregon State University), Sabine Hauert (University of Bristol), Michael Littman (Brown University), and Ella Scallan (AIhub). As AI tools have become ubiquitous, we've seen growing concern and increasing coverage about how the use of such tools from a formative age might affect children. What do you think the impact will be and what skills might young people need to navigate this AI world? I met up with a bunch of high school friends when I was last in Switzerland and they were all wondering what their kids should study. They were wondering if they should do social science, seeing as AI tools have become adept at many tasks, such as coding, writing, art, etc. I think that we need social sciences, but that we also need people who know the technology and who can continue developing it. I say they should continue doing whatever they're interested in and those jobs will evolve and they'll look different, but there will still be a whole wealth of different types of jobs.



Trump pitches cognitive tests for leaders, questions if Harris, Walz, Newsom could pass

FOX News

President Donald Trump proposes mandatory cognitive tests for all presidents and vice presidents while criticizing California Gov. Gavin Newsom and other Democrats at GOP retreat.


Giving a 140 pound stingray a check up requires 8 people

Popular Science

The male leopard whiptail ray also boasts a four-foot-three-inch wingspan. Leopard whiptail rays have spotted skin and a long, thin tail they use for balance, steering, and defense. Breakthroughs, discoveries, and DIY tips sent every weekday. Getting that annual check-up can feel daunting for anyone. At the weight of an adult human with a four-foot-three-inch wingspan, just moving the giant fish from its habitat to an exam pool is an exercise in teamwork.


Reasoning Models Ace the CFA Exams

Patel, Jaisal, Chen, Yunzhe, He, Kaiwen, Wang, Keyi, Li, David, Xiao, Kairong, Liu, Xiao-Yang

arXiv.org Artificial Intelligence

Previous research has reported that large language models (LLMs) demonstrate poor performance on the Chartered Financial Analyst (CFA) exams. However, recent reasoning models have achieved strong results on graduate-level academic and professional examinations across various disciplines. In this paper, we evaluate state-of-the-art reasoning models on a set of mock CFA exams consisting of 980 questions across three Level I exams, two Level II exams, and three Level III exams. Using the same pass/fail criteria from prior studies, we find that most models clear all three levels. The models that pass, ordered by overall performance, are Gemini 3.0 Pro, Gemini 2.5 Pro, GPT-5, Grok 4, Claude Opus 4.1, and DeepSeek-V3.1. Specifically, Gemini 3.0 Pro achieves a record score of 97.6% on Level I. Performance is also strong on Level II, led by GPT-5 at 94.3%. On Level III, Gemini 2.5 Pro attains the highest score with 86.4% on multiple-choice questions while Gemini 3.0 Pro achieves 92.0% on constructed-response questions.


CODE-II: A large-scale dataset for artificial intelligence in ECG analysis

Abreu, Petrus E. O. G. B., Paixão, Gabriela M. M., Li, Jiawei, Gomes, Paulo R., Macfarlane, Peter W., Oliveira, Ana C. S., Carvalho, Vinicius T., Schön, Thomas B., Ribeiro, Antonio Luiz P., Ribeiro, Antônio H.

arXiv.org Artificial Intelligence

Data-driven methods for electrocardiogram (ECG) interpretation are rapidly progressing. Large datasets have enabled advances in artificial intelligence (AI) based ECG analysis, yet limitations in annotation quality, size, and scope remain major challenges. Here we present CODE-II, a large-scale real-world dataset of 2,735,269 12-lead ECGs from 2,093,807 adult patients collected by the Telehealth Network of Minas Gerais (TNMG), Brazil. Each exam was annotated using standardized diagnostic criteria and reviewed by cardiologists. A defining feature of CODE-II is a set of 66 clinically meaningful diagnostic classes, developed with cardiologist input and routinely used in telehealth practice. We additionally provide an open available subset: CODE-II-open, a public subset of 15,000 patients, and the CODE-II-test, a non-overlapping set of 8,475 exams reviewed by multiple cardiologists for blinded evaluation. A neural network pre-trained on CODE-II achieved superior transfer performance on external benchmarks (PTB-XL and CPSC 2018) and outperformed alternatives trained on larger datasets.


HEAD-QA v2: Expanding a Healthcare Benchmark for Reasoning

Correa-Guillén, Alexis, Gómez-Rodríguez, Carlos, Vilares, David

arXiv.org Artificial Intelligence

We introduce HEAD-QA v2, an expanded and updated version of a Spanish/English healthcare multiple-choice reasoning dataset originally released by Vilares and Gómez-Rodríguez (2019). The update responds to the growing need for high-quality datasets that capture the linguistic and conceptual complexity of healthcare reasoning. We extend the dataset to over 12,000 questions from ten years of Spanish professional exams, benchmark several open-source LLMs using prompting, RAG, and probability-based answer selection, and provide additional multilingual versions to support future work. Results indicate that performance is mainly driven by model scale and intrinsic reasoning ability, with complex inference strategies obtaining limited gains. Together, these results establish HEAD-QA v2 as a reliable resource for advancing research on biomedical reasoning and model improvement.


Assessing the Capability of LLMs in Solving POSCOMP Questions

Viegas, Cayo, Gheyi, Rohit, Ribeiro, Márcio

arXiv.org Artificial Intelligence

--Recent advancements in Large Language Models (LLMs) have significantly expanded the capabilities of artificial intelligence in natural language processing tasks. Despite this progress, their performance in specialized domains such as computer science remains relatively unexplored. Understanding the proficiency of LLMs in these domains is critical for evaluating their practical utility and guiding future developments. The POSCOMP, a prestigious Brazilian examination used for graduate admissions in computer science promoted by the Brazlian Computer Society (SBC), provides a challenging benchmark. This study investigates whether LLMs can match or surpass human performance on the POSCOMP exam. Four LLMs - ChatGPT -4, Gemini 1.0 Advanced, Claude 3 Sonnet, and Le Chat Mistral Large - were initially evaluated on the 2022 and 2023 POSCOMP exams. The assessments measured the models' proficiency in handling complex questions typical of the exam. LLM performance was notably better on text-based questions than on image interpretation tasks. In the 2022 exam, ChatGPT - 4 led with 57 correct answers out of 69 questions, followed by Gemini 1.0 Advanced (49), Le Chat Mistral (48), and Claude 3 Sonnet (44). Similar trends were observed in the 2023 exam. ChatGPT -4 achieved the highest performance, surpassing all students who took the POSCOMP 2023 exam. LLMs, particularly ChatGPT -4, show promise in text-based tasks on the POSCOMP exam, although image interpretation remains a challenge. Given the rapid evolution of LLMs, we expanded our analysis to include more recent models - o1, Gemini 2.5 Pro, Claude 3.7 Sonnet, and o3-mini-high - evaluated on the 2022-2024 POSCOMP exams. These newer models demonstrate further improvements and consistently surpass both the average and top-performing human participants across all three years. The POSCOMP [1] is a prestigious assessment designed to test the knowledge of prospective computer science graduate students, promoted by the Brazilian Computer Society (SBC). It serves as an entry criterion for many graduate programs across Brazil. Using this exam as a benchmark for evaluating Large Language Models (LLMs) allows for a direct comparison between AI capabilities and human standards, offering valuable insights into the strengths and limitations of current AI models. Recent advancements in LLMs [2], [3] have significantly expanded the capabilities of Artificial Intelligence (AI), particularly in natural language processing tasks.


NLP Methods May Actually Be Better Than Professors at Estimating Question Difficulty

Zotos, Leonidas, de Jong, Ivo Pascal, Valdenegro-Toro, Matias, Sburlea, Andreea Ioana, Nissim, Malvina, van Rijn, Hedderik

arXiv.org Artificial Intelligence

Estimating the difficulty of exam questions is essential for developing good exams, but professors are not always good at this task. We compare various Large Language Model-based methods with three professors in their ability to estimate what percentage of students will give correct answers on True/False exam questions in the areas of Neural Networks and Machine Learning. Our results show that the professors have limited ability to distinguish between easy and difficult questions and that they are outperformed by directly asking Gemini 2.5 to solve this task. Yet, we obtained even better results using uncertainties of the LLMs solving the questions in a supervised learning setting, using only 42 training samples. We conclude that supervised learning using LLM uncertainty can help professors better estimate the difficulty of exam questions, improving the quality of assessment.


Artificial-Intelligence Grading Assistance for Handwritten Components of a Calculus Exam

Kortemeyer, Gerd, Caspar, Alexander, Horica, Daria

arXiv.org Artificial Intelligence

We investigate whether contemporary multimodal LLMs can assist with grading open-ended calculus at scale without eroding validity. In a large first-year exam, students' handwritten work was graded by GPT-5 against the same rubric used by teaching assistants (TAs), with fractional credit permitted; TA rubric decisions served as ground truth. We calibrated a human-in-the-loop filter that combines a partial-credit threshold with an Item Response Theory (2PL) risk measure based on the deviation between the AI score and the model-expected score for each student-item. Unfiltered AI-TA agreement was moderate, adequate for low-stakes feedback but not for high-stakes use. Confidence filtering made the workload-quality trade-off explicit: under stricter settings, AI delivered human-level accuracy, but also left roughly 70% of the items to be graded by humans. Psychometric patterns were constrained by low stakes on the open-ended portion, a small set of rubric checkpoints, and occasional misalignment between designated answer regions and where work appeared. Practical adjustments such as slightly higher weight and protected time, a few rubric-visible substeps, stronger spatial anchoring should raise ceiling performance. Overall, calibrated confidence and conservative routing enable AI to reliably handle a sizable subset of routine cases while reserving expert judgment for ambiguous or pedagogically rich responses.