Goto

Collaborating Authors

 exam


Basketball-playing robot built by sixth-formers wins tech competition

BBC News

Meet the UK's very own LeBron James... but not as you know it Look out LeBron James and Michael Jordan, there's a new basketball champ around. But it was made in Lisburn rather than Los Angeles or Chicago. The name 25416 may not appear on many replica vests, but it can shoot hoops like no-one else. And the basketball-playing robot won a school in Lisburn first prize at the UK-wide First Tech Challenge robotics competition. The team of sixth-formers from Friends' School came top of 48 schools from across the UK at the competition held in London's Copper Box Arena. Going down and working on it with my friends is honestly one of the highlights of my last year in school, he said.



AIhub coffee corner: AI, kids, and the future – "generation AI"

AIHub

This month we tackle the topic of young people and what AI tools mean for their future. Joining the conversation this time are: Sanmay Das (Virginia Tech), Tom Dietterich (Oregon State University), Sabine Hauert (University of Bristol), Michael Littman (Brown University), and Ella Scallan (AIhub). As AI tools have become ubiquitous, we've seen growing concern and increasing coverage about how the use of such tools from a formative age might affect children. What do you think the impact will be and what skills might young people need to navigate this AI world? I met up with a bunch of high school friends when I was last in Switzerland and they were all wondering what their kids should study. They were wondering if they should do social science, seeing as AI tools have become adept at many tasks, such as coding, writing, art, etc. I think that we need social sciences, but that we also need people who know the technology and who can continue developing it. I say they should continue doing whatever they're interested in and those jobs will evolve and they'll look different, but there will still be a whole wealth of different types of jobs.




Giving a 140 pound stingray a check up requires 8 people

Popular Science

The male leopard whiptail ray also boasts a four-foot-three-inch wingspan. Leopard whiptail rays have spotted skin and a long, thin tail they use for balance, steering, and defense. Breakthroughs, discoveries, and DIY tips sent every weekday. Getting that annual check-up can feel daunting for anyone. At the weight of an adult human with a four-foot-three-inch wingspan, just moving the giant fish from its habitat to an exam pool is an exercise in teamwork.


Reasoning Models Ace the CFA Exams

arXiv.org Artificial Intelligence

Previous research has reported that large language models (LLMs) demonstrate poor performance on the Chartered Financial Analyst (CFA) exams. However, recent reasoning models have achieved strong results on graduate-level academic and professional examinations across various disciplines. In this paper, we evaluate state-of-the-art reasoning models on a set of mock CFA exams consisting of 980 questions across three Level I exams, two Level II exams, and three Level III exams. Using the same pass/fail criteria from prior studies, we find that most models clear all three levels. The models that pass, ordered by overall performance, are Gemini 3.0 Pro, Gemini 2.5 Pro, GPT-5, Grok 4, Claude Opus 4.1, and DeepSeek-V3.1. Specifically, Gemini 3.0 Pro achieves a record score of 97.6% on Level I. Performance is also strong on Level II, led by GPT-5 at 94.3%. On Level III, Gemini 2.5 Pro attains the highest score with 86.4% on multiple-choice questions while Gemini 3.0 Pro achieves 92.0% on constructed-response questions.


CODE-II: A large-scale dataset for artificial intelligence in ECG analysis

arXiv.org Artificial Intelligence

Data-driven methods for electrocardiogram (ECG) interpretation are rapidly progressing. Large datasets have enabled advances in artificial intelligence (AI) based ECG analysis, yet limitations in annotation quality, size, and scope remain major challenges. Here we present CODE-II, a large-scale real-world dataset of 2,735,269 12-lead ECGs from 2,093,807 adult patients collected by the Telehealth Network of Minas Gerais (TNMG), Brazil. Each exam was annotated using standardized diagnostic criteria and reviewed by cardiologists. A defining feature of CODE-II is a set of 66 clinically meaningful diagnostic classes, developed with cardiologist input and routinely used in telehealth practice. We additionally provide an open available subset: CODE-II-open, a public subset of 15,000 patients, and the CODE-II-test, a non-overlapping set of 8,475 exams reviewed by multiple cardiologists for blinded evaluation. A neural network pre-trained on CODE-II achieved superior transfer performance on external benchmarks (PTB-XL and CPSC 2018) and outperformed alternatives trained on larger datasets.


HEAD-QA v2: Expanding a Healthcare Benchmark for Reasoning

arXiv.org Artificial Intelligence

We introduce HEAD-QA v2, an expanded and updated version of a Spanish/English healthcare multiple-choice reasoning dataset originally released by Vilares and Gómez-Rodríguez (2019). The update responds to the growing need for high-quality datasets that capture the linguistic and conceptual complexity of healthcare reasoning. We extend the dataset to over 12,000 questions from ten years of Spanish professional exams, benchmark several open-source LLMs using prompting, RAG, and probability-based answer selection, and provide additional multilingual versions to support future work. Results indicate that performance is mainly driven by model scale and intrinsic reasoning ability, with complex inference strategies obtaining limited gains. Together, these results establish HEAD-QA v2 as a reliable resource for advancing research on biomedical reasoning and model improvement.


Assessing the Capability of LLMs in Solving POSCOMP Questions

arXiv.org Artificial Intelligence

--Recent advancements in Large Language Models (LLMs) have significantly expanded the capabilities of artificial intelligence in natural language processing tasks. Despite this progress, their performance in specialized domains such as computer science remains relatively unexplored. Understanding the proficiency of LLMs in these domains is critical for evaluating their practical utility and guiding future developments. The POSCOMP, a prestigious Brazilian examination used for graduate admissions in computer science promoted by the Brazlian Computer Society (SBC), provides a challenging benchmark. This study investigates whether LLMs can match or surpass human performance on the POSCOMP exam. Four LLMs - ChatGPT -4, Gemini 1.0 Advanced, Claude 3 Sonnet, and Le Chat Mistral Large - were initially evaluated on the 2022 and 2023 POSCOMP exams. The assessments measured the models' proficiency in handling complex questions typical of the exam. LLM performance was notably better on text-based questions than on image interpretation tasks. In the 2022 exam, ChatGPT - 4 led with 57 correct answers out of 69 questions, followed by Gemini 1.0 Advanced (49), Le Chat Mistral (48), and Claude 3 Sonnet (44). Similar trends were observed in the 2023 exam. ChatGPT -4 achieved the highest performance, surpassing all students who took the POSCOMP 2023 exam. LLMs, particularly ChatGPT -4, show promise in text-based tasks on the POSCOMP exam, although image interpretation remains a challenge. Given the rapid evolution of LLMs, we expanded our analysis to include more recent models - o1, Gemini 2.5 Pro, Claude 3.7 Sonnet, and o3-mini-high - evaluated on the 2022-2024 POSCOMP exams. These newer models demonstrate further improvements and consistently surpass both the average and top-performing human participants across all three years. The POSCOMP [1] is a prestigious assessment designed to test the knowledge of prospective computer science graduate students, promoted by the Brazilian Computer Society (SBC). It serves as an entry criterion for many graduate programs across Brazil. Using this exam as a benchmark for evaluating Large Language Models (LLMs) allows for a direct comparison between AI capabilities and human standards, offering valuable insights into the strengths and limitations of current AI models. Recent advancements in LLMs [2], [3] have significantly expanded the capabilities of Artificial Intelligence (AI), particularly in natural language processing tasks.