exam
AIhub coffee corner: AI, kids, and the future – "generation AI"
This month we tackle the topic of young people and what AI tools mean for their future. Joining the conversation this time are: Sanmay Das (Virginia Tech), Tom Dietterich (Oregon State University), Sabine Hauert (University of Bristol), Michael Littman (Brown University), and Ella Scallan (AIhub). As AI tools have become ubiquitous, we've seen growing concern and increasing coverage about how the use of such tools from a formative age might affect children. What do you think the impact will be and what skills might young people need to navigate this AI world? I met up with a bunch of high school friends when I was last in Switzerland and they were all wondering what their kids should study. They were wondering if they should do social science, seeing as AI tools have become adept at many tasks, such as coding, writing, art, etc. I think that we need social sciences, but that we also need people who know the technology and who can continue developing it. I say they should continue doing whatever they're interested in and those jobs will evolve and they'll look different, but there will still be a whole wealth of different types of jobs.
- North America > United States > Virginia (0.24)
- North America > United States > Oregon (0.24)
- Europe > Switzerland (0.24)
- (2 more...)
- Health & Medicine (0.67)
- Education > Assessment & Standards (0.67)
- Education > Educational Setting > K-12 Education > Secondary School (0.47)
- North America > United States > California (0.25)
- South America > Venezuela (0.06)
- North America > United States > North Carolina > Mecklenburg County > Charlotte (0.04)
- (4 more...)
- Media > News (1.00)
- Government > Regional Government > North America Government > United States Government (1.00)
Giving a 140 pound stingray a check up requires 8 people
The male leopard whiptail ray also boasts a four-foot-three-inch wingspan. Leopard whiptail rays have spotted skin and a long, thin tail they use for balance, steering, and defense. Breakthroughs, discoveries, and DIY tips sent every weekday. Getting that annual check-up can feel daunting for anyone. At the weight of an adult human with a four-foot-three-inch wingspan, just moving the giant fish from its habitat to an exam pool is an exercise in teamwork.
- South America > Chile (0.05)
- South America > Brazil (0.05)
- North America > United States > New Jersey (0.05)
Reasoning Models Ace the CFA Exams
Patel, Jaisal, Chen, Yunzhe, He, Kaiwen, Wang, Keyi, Li, David, Xiao, Kairong, Liu, Xiao-Yang
Previous research has reported that large language models (LLMs) demonstrate poor performance on the Chartered Financial Analyst (CFA) exams. However, recent reasoning models have achieved strong results on graduate-level academic and professional examinations across various disciplines. In this paper, we evaluate state-of-the-art reasoning models on a set of mock CFA exams consisting of 980 questions across three Level I exams, two Level II exams, and three Level III exams. Using the same pass/fail criteria from prior studies, we find that most models clear all three levels. The models that pass, ordered by overall performance, are Gemini 3.0 Pro, Gemini 2.5 Pro, GPT-5, Grok 4, Claude Opus 4.1, and DeepSeek-V3.1. Specifically, Gemini 3.0 Pro achieves a record score of 97.6% on Level I. Performance is also strong on Level II, led by GPT-5 at 94.3%. On Level III, Gemini 2.5 Pro attains the highest score with 86.4% on multiple-choice questions while Gemini 3.0 Pro achieves 92.0% on constructed-response questions.
- North America > United States > North Carolina > Orange County > Chapel Hill (0.04)
- North America > United States > New York > Rensselaer County > Troy (0.04)
- North America > United States > Florida > Miami-Dade County > Miami (0.04)
- Asia > South Korea (0.04)
CODE-II: A large-scale dataset for artificial intelligence in ECG analysis
Abreu, Petrus E. O. G. B., Paixão, Gabriela M. M., Li, Jiawei, Gomes, Paulo R., Macfarlane, Peter W., Oliveira, Ana C. S., Carvalho, Vinicius T., Schön, Thomas B., Ribeiro, Antonio Luiz P., Ribeiro, Antônio H.
Data-driven methods for electrocardiogram (ECG) interpretation are rapidly progressing. Large datasets have enabled advances in artificial intelligence (AI) based ECG analysis, yet limitations in annotation quality, size, and scope remain major challenges. Here we present CODE-II, a large-scale real-world dataset of 2,735,269 12-lead ECGs from 2,093,807 adult patients collected by the Telehealth Network of Minas Gerais (TNMG), Brazil. Each exam was annotated using standardized diagnostic criteria and reviewed by cardiologists. A defining feature of CODE-II is a set of 66 clinically meaningful diagnostic classes, developed with cardiologist input and routinely used in telehealth practice. We additionally provide an open available subset: CODE-II-open, a public subset of 15,000 patients, and the CODE-II-test, a non-overlapping set of 8,475 exams reviewed by multiple cardiologists for blinded evaluation. A neural network pre-trained on CODE-II achieved superior transfer performance on external benchmarks (PTB-XL and CPSC 2018) and outperformed alternatives trained on larger datasets.
- South America > Brazil > Minas Gerais (0.24)
- Europe > Germany (0.04)
- Asia > China > Zhejiang Province > Ningbo (0.04)
- (8 more...)
- Research Report > Experimental Study (1.00)
- Overview (1.00)
- Research Report > New Finding (0.93)
- Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (0.68)
HEAD-QA v2: Expanding a Healthcare Benchmark for Reasoning
Correa-Guillén, Alexis, Gómez-Rodríguez, Carlos, Vilares, David
We introduce HEAD-QA v2, an expanded and updated version of a Spanish/English healthcare multiple-choice reasoning dataset originally released by Vilares and Gómez-Rodríguez (2019). The update responds to the growing need for high-quality datasets that capture the linguistic and conceptual complexity of healthcare reasoning. We extend the dataset to over 12,000 questions from ten years of Spanish professional exams, benchmark several open-source LLMs using prompting, RAG, and probability-based answer selection, and provide additional multilingual versions to support future work. Results indicate that performance is mainly driven by model scale and intrinsic reasoning ability, with complex inference strategies obtaining limited gains. Together, these results establish HEAD-QA v2 as a reliable resource for advancing research on biomedical reasoning and model improvement.
- North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
- North America > United States > New York > New York County > New York City (0.04)
- North America > United States > Texas > Travis County > Austin (0.04)
- (10 more...)
- Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
- Information Technology > Artificial Intelligence > Natural Language > Chatbot (0.97)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.97)
Assessing the Capability of LLMs in Solving POSCOMP Questions
Viegas, Cayo, Gheyi, Rohit, Ribeiro, Márcio
--Recent advancements in Large Language Models (LLMs) have significantly expanded the capabilities of artificial intelligence in natural language processing tasks. Despite this progress, their performance in specialized domains such as computer science remains relatively unexplored. Understanding the proficiency of LLMs in these domains is critical for evaluating their practical utility and guiding future developments. The POSCOMP, a prestigious Brazilian examination used for graduate admissions in computer science promoted by the Brazlian Computer Society (SBC), provides a challenging benchmark. This study investigates whether LLMs can match or surpass human performance on the POSCOMP exam. Four LLMs - ChatGPT -4, Gemini 1.0 Advanced, Claude 3 Sonnet, and Le Chat Mistral Large - were initially evaluated on the 2022 and 2023 POSCOMP exams. The assessments measured the models' proficiency in handling complex questions typical of the exam. LLM performance was notably better on text-based questions than on image interpretation tasks. In the 2022 exam, ChatGPT - 4 led with 57 correct answers out of 69 questions, followed by Gemini 1.0 Advanced (49), Le Chat Mistral (48), and Claude 3 Sonnet (44). Similar trends were observed in the 2023 exam. ChatGPT -4 achieved the highest performance, surpassing all students who took the POSCOMP 2023 exam. LLMs, particularly ChatGPT -4, show promise in text-based tasks on the POSCOMP exam, although image interpretation remains a challenge. Given the rapid evolution of LLMs, we expanded our analysis to include more recent models - o1, Gemini 2.5 Pro, Claude 3.7 Sonnet, and o3-mini-high - evaluated on the 2022-2024 POSCOMP exams. These newer models demonstrate further improvements and consistently surpass both the average and top-performing human participants across all three years. The POSCOMP [1] is a prestigious assessment designed to test the knowledge of prospective computer science graduate students, promoted by the Brazilian Computer Society (SBC). It serves as an entry criterion for many graduate programs across Brazil. Using this exam as a benchmark for evaluating Large Language Models (LLMs) allows for a direct comparison between AI capabilities and human standards, offering valuable insights into the strengths and limitations of current AI models. Recent advancements in LLMs [2], [3] have significantly expanded the capabilities of Artificial Intelligence (AI), particularly in natural language processing tasks.
- South America > Brazil (0.25)
- Asia > Japan > Honshū > Chūbu > Toyama Prefecture > Toyama (0.04)
- South America > Peru (0.04)
- (2 more...)
- Education > Assessment & Standards (0.46)
- Education > Educational Setting > Higher Education (0.34)
NLP Methods May Actually Be Better Than Professors at Estimating Question Difficulty
Zotos, Leonidas, de Jong, Ivo Pascal, Valdenegro-Toro, Matias, Sburlea, Andreea Ioana, Nissim, Malvina, van Rijn, Hedderik
Estimating the difficulty of exam questions is essential for developing good exams, but professors are not always good at this task. We compare various Large Language Model-based methods with three professors in their ability to estimate what percentage of students will give correct answers on True/False exam questions in the areas of Neural Networks and Machine Learning. Our results show that the professors have limited ability to distinguish between easy and difficult questions and that they are outperformed by directly asking Gemini 2.5 to solve this task. Yet, we obtained even better results using uncertainties of the LLMs solving the questions in a supervised learning setting, using only 42 training samples. We conclude that supervised learning using LLM uncertainty can help professors better estimate the difficulty of exam questions, improving the quality of assessment.
- Asia > Middle East > UAE > Abu Dhabi Emirate > Abu Dhabi (0.14)
- North America > Mexico > Mexico City > Mexico City (0.04)
- Europe > United Kingdom > England > Oxfordshire > Oxford (0.04)
- (6 more...)
Artificial-Intelligence Grading Assistance for Handwritten Components of a Calculus Exam
Kortemeyer, Gerd, Caspar, Alexander, Horica, Daria
We investigate whether contemporary multimodal LLMs can assist with grading open-ended calculus at scale without eroding validity. In a large first-year exam, students' handwritten work was graded by GPT-5 against the same rubric used by teaching assistants (TAs), with fractional credit permitted; TA rubric decisions served as ground truth. We calibrated a human-in-the-loop filter that combines a partial-credit threshold with an Item Response Theory (2PL) risk measure based on the deviation between the AI score and the model-expected score for each student-item. Unfiltered AI-TA agreement was moderate, adequate for low-stakes feedback but not for high-stakes use. Confidence filtering made the workload-quality trade-off explicit: under stricter settings, AI delivered human-level accuracy, but also left roughly 70% of the items to be graded by humans. Psychometric patterns were constrained by low stakes on the open-ended portion, a small set of rubric checkpoints, and occasional misalignment between designated answer regions and where work appeared. Practical adjustments such as slightly higher weight and protected time, a few rubric-visible substeps, stronger spatial anchoring should raise ceiling performance. Overall, calibrated confidence and conservative routing enable AI to reliably handle a sizable subset of routine cases while reserving expert judgment for ambiguous or pedagogically rich responses.
- North America > United States > Michigan (0.04)
- North America > United States > New York (0.04)
- North America > United States > District of Columbia > Washington (0.04)
- (2 more...)
- Instructional Material (0.68)
- Research Report (0.50)
- Education > Assessment & Standards (0.93)
- Education > Curriculum > Subject-Specific Education (0.70)
- Education > Educational Setting (0.66)