AITopics | exam

Collaborating Authors

exam

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

AIhub coffee corner: AI, kids, and the future – "generation AI"

AIHubMar-13-2026, 15:45:21 GMT

This month we tackle the topic of young people and what AI tools mean for their future. Joining the conversation this time are: Sanmay Das (Virginia Tech), Tom Dietterich (Oregon State University), Sabine Hauert (University of Bristol), Michael Littman (Brown University), and Ella Scallan (AIhub). As AI tools have become ubiquitous, we've seen growing concern and increasing coverage about how the use of such tools from a formative age might affect children. What do you think the impact will be and what skills might young people need to navigate this AI world? I met up with a bunch of high school friends when I was last in Switzerland and they were all wondering what their kids should study. They were wondering if they should do social science, seeing as AI tools have become adept at many tasks, such as coding, writing, art, etc. I think that we need social sciences, but that we also need people who know the technology and who can continue developing it. I say they should continue doing whatever they're interested in and those jobs will evolve and they'll look different, but there will still be a whole wealth of different types of jobs.

artificial intelligence, large language model, natural language, (18 more...)

AIHub

Country:

North America > United States > Virginia (0.24)
North America > United States > Oregon (0.24)
Europe > Switzerland (0.24)
(2 more...)

Genre: Personal > Interview (1.00)

Industry: Education > Educational Setting > K-12 Education (0.48)

Technology:

Information Technology > Communications > Social Media (0.95)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.31)

Add feedback

117c5c8622b0d539f74f6d1fb082a2e9-Paper-Datasets_and_Benchmarks.pdf

Neural Information Processing SystemsFeb-8-2026, 00:24:08 GMT

dataset, evaluation, llm, (15 more...)

Neural Information Processing Systems

Country:

Asia > Thailand (0.05)
Africa > Kenya (0.04)
Asia > China > Beijing > Beijing (0.04)
(12 more...)

Genre: Research Report > New Finding (0.67)

Industry:

Health & Medicine (0.67)
Education > Assessment & Standards (0.67)
Education > Educational Setting > K-12 Education > Secondary School (0.47)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Natural Language > Chatbot (0.71)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.70)

Add feedback

Trump pitches cognitive tests for leaders, questions if Harris, Walz, Newsom could pass

FOX NewsJan-6-2026, 18:44:07 GMT

President Donald Trump proposes mandatory cognitive tests for all presidents and vice presidents while criticizing California Gov. Gavin Newsom and other Democrats at GOP retreat.

artificial intelligence, social media, trump, (9 more...)

FOX News

Country:

North America > United States > California (0.25)
South America > Venezuela (0.06)
North America > United States > North Carolina > Mecklenburg County > Charlotte (0.04)
(4 more...)

Industry:

Media > News (1.00)
Government > Regional Government > North America Government > United States Government (1.00)

Technology:

Information Technology > Communications > Social Media (1.00)
Information Technology > Artificial Intelligence > Cognitive Science (0.62)

Add feedback

Giving a 140 pound stingray a check up requires 8 people

The male leopard whiptail ray also boasts a four-foot-three-inch wingspan. Leopard whiptail rays have spotted skin and a long, thin tail they use for balance, steering, and defense. Breakthroughs, discoveries, and DIY tips sent every weekday. Getting that annual check-up can feel daunting for anyone. At the weight of an adult human with a four-foot-three-inch wingspan, just moving the giant fish from its habitat to an exam pool is an exercise in teamwork.

andrew paul, laura baisa, ray, (11 more...)

Popular Science

Country:

South America > Chile (0.05)
South America > Brazil (0.05)
North America > United States > New Jersey (0.05)

Industry: Health & Medicine > Therapeutic Area (0.73)

Technology: Information Technology > Artificial Intelligence (0.36)

Add feedback

Reasoning Models Ace the CFA Exams

Patel, Jaisal, Chen, Yunzhe, He, Kaiwen, Wang, Keyi, Li, David, Xiao, Kairong, Liu, Xiao-Yang

arXiv.org Artificial IntelligenceDec-10-2025

Previous research has reported that large language models (LLMs) demonstrate poor performance on the Chartered Financial Analyst (CFA) exams. However, recent reasoning models have achieved strong results on graduate-level academic and professional examinations across various disciplines. In this paper, we evaluate state-of-the-art reasoning models on a set of mock CFA exams consisting of 980 questions across three Level I exams, two Level II exams, and three Level III exams. Using the same pass/fail criteria from prior studies, we find that most models clear all three levels. The models that pass, ordered by overall performance, are Gemini 3.0 Pro, Gemini 2.5 Pro, GPT-5, Grok 4, Claude Opus 4.1, and DeepSeek-V3.1. Specifically, Gemini 3.0 Pro achieves a record score of 97.6% on Level I. Performance is also strong on Level II, led by GPT-5 at 94.3%. On Level III, Gemini 2.5 Pro attains the highest score with 86.4% on multiple-choice questions while Gemini 3.0 Pro achieves 92.0% on constructed-response questions.

exam, large language model, machine learning, (19 more...)

arXiv.org Artificial Intelligence

2512.0827

Country:

North America > United States > North Carolina > Orange County > Chapel Hill (0.04)
North America > United States > New York > Rensselaer County > Troy (0.04)
North America > United States > Florida > Miami-Dade County > Miami (0.04)
Asia > South Korea (0.04)

Genre: Research Report > New Finding (0.46)

Industry: Education (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Natural Language > Chatbot (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

CODE-II: A large-scale dataset for artificial intelligence in ECG analysis

Abreu, Petrus E. O. G. B., Paixão, Gabriela M. M., Li, Jiawei, Gomes, Paulo R., Macfarlane, Peter W., Oliveira, Ana C. S., Carvalho, Vinicius T., Schön, Thomas B., Ribeiro, Antonio Luiz P., Ribeiro, Antônio H.

arXiv.org Artificial IntelligenceNov-20-2025

Data-driven methods for electrocardiogram (ECG) interpretation are rapidly progressing. Large datasets have enabled advances in artificial intelligence (AI) based ECG analysis, yet limitations in annotation quality, size, and scope remain major challenges. Here we present CODE-II, a large-scale real-world dataset of 2,735,269 12-lead ECGs from 2,093,807 adult patients collected by the Telehealth Network of Minas Gerais (TNMG), Brazil. Each exam was annotated using standardized diagnostic criteria and reviewed by cardiologists. A defining feature of CODE-II is a set of 66 clinically meaningful diagnostic classes, developed with cardiologist input and routinely used in telehealth practice. We additionally provide an open available subset: CODE-II-open, a public subset of 15,000 patients, and the CODE-II-test, a non-overlapping set of 8,475 exams reviewed by multiple cardiologists for blinded evaluation. A neural network pre-trained on CODE-II achieved superior transfer performance on external benchmarks (PTB-XL and CPSC 2018) and outperformed alternatives trained on larger datasets.

artificial intelligence, machine learning, survey article, (17 more...)

arXiv.org Artificial Intelligence

2511.15632

Country:

South America > Brazil > Minas Gerais (0.24)
Europe > Germany (0.04)
Asia > China > Zhejiang Province > Ningbo (0.04)
(8 more...)

Genre:

Research Report > Experimental Study (1.00)
Overview (1.00)
Research Report > New Finding (0.93)

Industry:

Health & Medicine > Therapeutic Area > Cardiology/Vascular Diseases (1.00)
Health & Medicine > Health Care Technology (1.00)
Health & Medicine > Diagnostic Medicine (1.00)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (0.68)

Add feedback

HEAD-QA v2: Expanding a Healthcare Benchmark for Reasoning

Correa-Guillén, Alexis, Gómez-Rodríguez, Carlos, Vilares, David

arXiv.org Artificial IntelligenceNov-20-2025

We introduce HEAD-QA v2, an expanded and updated version of a Spanish/English healthcare multiple-choice reasoning dataset originally released by Vilares and Gómez-Rodríguez (2019). The update responds to the growing need for high-quality datasets that capture the linguistic and conceptual complexity of healthcare reasoning. We extend the dataset to over 12,000 questions from ten years of Spanish professional exams, benchmark several open-source LLMs using prompting, RAG, and probability-based answer selection, and provide additional multilingual versions to support future work. Results indicate that performance is mainly driven by model scale and intrinsic reasoning ability, with complex inference strategies obtaining limited gains. Together, these results establish HEAD-QA v2 as a reliable resource for advancing research on biomedical reasoning and model improvement.

computational linguistic, large language model, machine learning, (19 more...)

arXiv.org Artificial Intelligence

2511.15355

Country:

North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
North America > United States > New York > New York County > New York City (0.04)
North America > United States > Texas > Travis County > Austin (0.04)
(10 more...)

Genre: Research Report > New Finding (0.68)

Industry: Health & Medicine (1.00)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Natural Language > Chatbot (0.97)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.97)

Add feedback

Assessing the Capability of LLMs in Solving POSCOMP Questions

Viegas, Cayo, Gheyi, Rohit, Ribeiro, Márcio

arXiv.org Artificial IntelligenceNov-19-2025

--Recent advancements in Large Language Models (LLMs) have significantly expanded the capabilities of artificial intelligence in natural language processing tasks. Despite this progress, their performance in specialized domains such as computer science remains relatively unexplored. Understanding the proficiency of LLMs in these domains is critical for evaluating their practical utility and guiding future developments. The POSCOMP, a prestigious Brazilian examination used for graduate admissions in computer science promoted by the Brazlian Computer Society (SBC), provides a challenging benchmark. This study investigates whether LLMs can match or surpass human performance on the POSCOMP exam. Four LLMs - ChatGPT -4, Gemini 1.0 Advanced, Claude 3 Sonnet, and Le Chat Mistral Large - were initially evaluated on the 2022 and 2023 POSCOMP exams. The assessments measured the models' proficiency in handling complex questions typical of the exam. LLM performance was notably better on text-based questions than on image interpretation tasks. In the 2022 exam, ChatGPT - 4 led with 57 correct answers out of 69 questions, followed by Gemini 1.0 Advanced (49), Le Chat Mistral (48), and Claude 3 Sonnet (44). Similar trends were observed in the 2023 exam. ChatGPT -4 achieved the highest performance, surpassing all students who took the POSCOMP 2023 exam. LLMs, particularly ChatGPT -4, show promise in text-based tasks on the POSCOMP exam, although image interpretation remains a challenge. Given the rapid evolution of LLMs, we expanded our analysis to include more recent models - o1, Gemini 2.5 Pro, Claude 3.7 Sonnet, and o3-mini-high - evaluated on the 2022-2024 POSCOMP exams. These newer models demonstrate further improvements and consistently surpass both the average and top-performing human participants across all three years. The POSCOMP [1] is a prestigious assessment designed to test the knowledge of prospective computer science graduate students, promoted by the Brazilian Computer Society (SBC). It serves as an entry criterion for many graduate programs across Brazil. Using this exam as a benchmark for evaluating Large Language Models (LLMs) allows for a direct comparison between AI capabilities and human standards, offering valuable insights into the strengths and limitations of current AI models. Recent advancements in LLMs [2], [3] have significantly expanded the capabilities of Artificial Intelligence (AI), particularly in natural language processing tasks.

exam, large language model, machine learning, (18 more...)

arXiv.org Artificial Intelligence

doi: 10.5753/jbcs.2025.4493

2505.20338

Country:

South America > Brazil (0.25)
Asia > Japan > Honshū > Chūbu > Toyama Prefecture > Toyama (0.04)
South America > Peru (0.04)
(2 more...)

Genre: Research Report > New Finding (1.00)

Industry:

Education > Assessment & Standards (0.46)
Education > Educational Setting > Higher Education (0.34)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

NLP Methods May Actually Be Better Than Professors at Estimating Question Difficulty

Zotos, Leonidas, de Jong, Ivo Pascal, Valdenegro-Toro, Matias, Sburlea, Andreea Ioana, Nissim, Malvina, van Rijn, Hedderik

arXiv.org Artificial IntelligenceNov-18-2025

Estimating the difficulty of exam questions is essential for developing good exams, but professors are not always good at this task. We compare various Large Language Model-based methods with three professors in their ability to estimate what percentage of students will give correct answers on True/False exam questions in the areas of Neural Networks and Machine Learning. Our results show that the professors have limited ability to distinguish between easy and difficult questions and that they are outperformed by directly asking Gemini 2.5 to solve this task. Yet, we obtained even better results using uncertainties of the LLMs solving the questions in a supervised learning setting, using only 42 training samples. We conclude that supervised learning using LLM uncertainty can help professors better estimate the difficulty of exam questions, improving the quality of assessment.

large language model, machine learning, natural language, (16 more...)

arXiv.org Artificial Intelligence

2508.03294

Country:

Asia > Middle East > UAE > Abu Dhabi Emirate > Abu Dhabi (0.14)
North America > Mexico > Mexico City > Mexico City (0.04)
Europe > United Kingdom > England > Oxfordshire > Oxford (0.04)
(6 more...)

Genre: Research Report > New Finding (1.00)

Industry: Education (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.69)

Add feedback

Artificial-Intelligence Grading Assistance for Handwritten Components of a Calculus Exam

Kortemeyer, Gerd, Caspar, Alexander, Horica, Daria

arXiv.org Artificial IntelligenceNov-14-2025

We investigate whether contemporary multimodal LLMs can assist with grading open-ended calculus at scale without eroding validity. In a large first-year exam, students' handwritten work was graded by GPT-5 against the same rubric used by teaching assistants (TAs), with fractional credit permitted; TA rubric decisions served as ground truth. We calibrated a human-in-the-loop filter that combines a partial-credit threshold with an Item Response Theory (2PL) risk measure based on the deviation between the AI score and the model-expected score for each student-item. Unfiltered AI-TA agreement was moderate, adequate for low-stakes feedback but not for high-stakes use. Confidence filtering made the workload-quality trade-off explicit: under stricter settings, AI delivered human-level accuracy, but also left roughly 70% of the items to be graded by humans. Psychometric patterns were constrained by low stakes on the open-ended portion, a small set of rubric checkpoints, and occasional misalignment between designated answer regions and where work appeared. Practical adjustments such as slightly higher weight and protected time, a few rubric-visible substeps, stronger spatial anchoring should raise ceiling performance. Overall, calibrated confidence and conservative routing enable AI to reliably handle a sizable subset of routine cases while reserving expert judgment for ambiguous or pedagogically rich responses.

large language model, machine learning, natural language, (18 more...)

arXiv.org Artificial Intelligence

2510.05162

Country:

North America > United States > Michigan (0.04)
North America > United States > New York (0.04)
North America > United States > District of Columbia > Washington (0.04)
(2 more...)

Genre:

Instructional Material (0.68)
Research Report (0.50)

Industry:

Education > Assessment & Standards (0.93)
Education > Curriculum > Subject-Specific Education (0.70)
Education > Educational Setting (0.66)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback