exam
Scientists' First Exam: Probing Cognitive Abilities of MLLM via Perception, Understanding, and Reasoning
Scientific discoveries increasingly rely on complex multimodal reasoning based on information-intensive scientific data and domain-specific expertise. Empowered by expert-level scientific benchmarks, scientific Multimodal Large Language Models (MLLMs) hold the potential to significantly enhance this discovery process in realistic workflows. However, current scientific benchmarks mostly focus on evaluating the knowledge understanding capabilities of MLLMs, leading to an inadequate assessment of their perception and reasoning abilities. To address this gap, we present the Scientists' First Exam (SFE) benchmark, designed to evaluate the scientific cognitive capacities of MLLMs through three interconnected levels:,, . Specifically, SFE comprises 830 expert-verified VQA pairs across three question types, spanning 66 multimodal tasks across five high-value disciplines. Extensive experiments reveal that current GPT-o3 and InternVL-3 achieve only 34.08% and 26.52% on SFE, highlighting significant room for MLLMs to improve in scientific realms. We hope the insights obtained in SFE will facilitate further developments in AI-enhanced scientific discoveries.
Basketball-playing robot built by sixth-formers wins tech competition
Meet the UK's very own LeBron James... but not as you know it Look out LeBron James and Michael Jordan, there's a new basketball champ around. But it was made in Lisburn rather than Los Angeles or Chicago. The name 25416 may not appear on many replica vests, but it can shoot hoops like no-one else. And the basketball-playing robot won a school in Lisburn first prize at the UK-wide First Tech Challenge robotics competition. The team of sixth-formers from Friends' School came top of 48 schools from across the UK at the competition held in London's Copper Box Arena. Going down and working on it with my friends is honestly one of the highlights of my last year in school, he said.
AIhub coffee corner: AI, kids, and the future – "generation AI"
This month we tackle the topic of young people and what AI tools mean for their future. Joining the conversation this time are: Sanmay Das (Virginia Tech), Tom Dietterich (Oregon State University), Sabine Hauert (University of Bristol), Michael Littman (Brown University), and Ella Scallan (AIhub). As AI tools have become ubiquitous, we've seen growing concern and increasing coverage about how the use of such tools from a formative age might affect children. What do you think the impact will be and what skills might young people need to navigate this AI world? I met up with a bunch of high school friends when I was last in Switzerland and they were all wondering what their kids should study. They were wondering if they should do social science, seeing as AI tools have become adept at many tasks, such as coding, writing, art, etc. I think that we need social sciences, but that we also need people who know the technology and who can continue developing it. I say they should continue doing whatever they're interested in and those jobs will evolve and they'll look different, but there will still be a whole wealth of different types of jobs.
Giving a 140 pound stingray a check up requires 8 people
The male leopard whiptail ray also boasts a four-foot-three-inch wingspan. Leopard whiptail rays have spotted skin and a long, thin tail they use for balance, steering, and defense. Breakthroughs, discoveries, and DIY tips sent every weekday. Getting that annual check-up can feel daunting for anyone. At the weight of an adult human with a four-foot-three-inch wingspan, just moving the giant fish from its habitat to an exam pool is an exercise in teamwork.
Reasoning Models Ace the CFA Exams
Patel, Jaisal, Chen, Yunzhe, He, Kaiwen, Wang, Keyi, Li, David, Xiao, Kairong, Liu, Xiao-Yang
Previous research has reported that large language models (LLMs) demonstrate poor performance on the Chartered Financial Analyst (CFA) exams. However, recent reasoning models have achieved strong results on graduate-level academic and professional examinations across various disciplines. In this paper, we evaluate state-of-the-art reasoning models on a set of mock CFA exams consisting of 980 questions across three Level I exams, two Level II exams, and three Level III exams. Using the same pass/fail criteria from prior studies, we find that most models clear all three levels. The models that pass, ordered by overall performance, are Gemini 3.0 Pro, Gemini 2.5 Pro, GPT-5, Grok 4, Claude Opus 4.1, and DeepSeek-V3.1. Specifically, Gemini 3.0 Pro achieves a record score of 97.6% on Level I. Performance is also strong on Level II, led by GPT-5 at 94.3%. On Level III, Gemini 2.5 Pro attains the highest score with 86.4% on multiple-choice questions while Gemini 3.0 Pro achieves 92.0% on constructed-response questions.
CODE-II: A large-scale dataset for artificial intelligence in ECG analysis
Abreu, Petrus E. O. G. B., Paixão, Gabriela M. M., Li, Jiawei, Gomes, Paulo R., Macfarlane, Peter W., Oliveira, Ana C. S., Carvalho, Vinicius T., Schön, Thomas B., Ribeiro, Antonio Luiz P., Ribeiro, Antônio H.
Data-driven methods for electrocardiogram (ECG) interpretation are rapidly progressing. Large datasets have enabled advances in artificial intelligence (AI) based ECG analysis, yet limitations in annotation quality, size, and scope remain major challenges. Here we present CODE-II, a large-scale real-world dataset of 2,735,269 12-lead ECGs from 2,093,807 adult patients collected by the Telehealth Network of Minas Gerais (TNMG), Brazil. Each exam was annotated using standardized diagnostic criteria and reviewed by cardiologists. A defining feature of CODE-II is a set of 66 clinically meaningful diagnostic classes, developed with cardiologist input and routinely used in telehealth practice. We additionally provide an open available subset: CODE-II-open, a public subset of 15,000 patients, and the CODE-II-test, a non-overlapping set of 8,475 exams reviewed by multiple cardiologists for blinded evaluation. A neural network pre-trained on CODE-II achieved superior transfer performance on external benchmarks (PTB-XL and CPSC 2018) and outperformed alternatives trained on larger datasets.
HEAD-QA v2: Expanding a Healthcare Benchmark for Reasoning
Correa-Guillén, Alexis, Gómez-Rodríguez, Carlos, Vilares, David
We introduce HEAD-QA v2, an expanded and updated version of a Spanish/English healthcare multiple-choice reasoning dataset originally released by Vilares and Gómez-Rodríguez (2019). The update responds to the growing need for high-quality datasets that capture the linguistic and conceptual complexity of healthcare reasoning. We extend the dataset to over 12,000 questions from ten years of Spanish professional exams, benchmark several open-source LLMs using prompting, RAG, and probability-based answer selection, and provide additional multilingual versions to support future work. Results indicate that performance is mainly driven by model scale and intrinsic reasoning ability, with complex inference strategies obtaining limited gains. Together, these results establish HEAD-QA v2 as a reliable resource for advancing research on biomedical reasoning and model improvement.