Goto

Collaborating Authors

 cabot


Cross-Platform Evaluation of Reasoning Capabilities in Foundation Models

de Curtò, J., de Zarzà, I., García, Pablo, Cabot, Jordi

arXiv.org Artificial Intelligence

This paper presents a comprehensive cross-platform evaluation of reasoning capabilities in contemporary foundation models, establishing an infrastructure-agnostic benchmark across three computational paradigms: HPC supercomputing (MareNostrum 5), cloud platforms (Nebius AI Studio), and university clusters (a node with eight H200 GPUs). We evaluate 15 foundation models across 79 problems spanning eight academic domains (Physics, Mathematics, Chemistry, Economics, Biology, Statistics, Calculus, and Optimization) through three experimental phases: (1) Baseline establishment: Six models (Mixtral-8x7B, Phi-3, LLaMA 3.1-8B, Gemma-2-9b, Mistral-7B, OLMo-7B) evaluated on 19 problems using MareNostrum 5, establishing methodology and reference performance; (2) Infrastructure validation: The 19-problem benchmark repeated on university cluster (seven models including Falcon-Mamba state-space architecture) and Nebius AI Studio (nine state-of-the-art models: Hermes-4 70B/405B, LLaMA 3.1-405B/3.3-70B, Qwen3 30B/235B, DeepSeek-R1, GPT-OSS 20B/120B) to confirm infrastructure-agnostic reproducibility; (3) Extended evaluation: Full 79-problem assessment on both university cluster and Nebius platforms, probing generalization at scale across architectural diversity. The findings challenge conventional scaling assumptions, establish training data quality as more critical than model size, and provide actionable guidelines for model selection across educational, production, and research contexts. The tri-infrastructure methodology and 79-problem benchmark enable longitudinal tracking of reasoning capabilities as foundation models evolve.


If A.I. Can Diagnose Patients, What Are Doctors For?

The New Yorker

If A.I. Can Diagnose Patients, What Are Doctors For? Large language models are transforming medicine--but the technology comes with side effects. "I'm worried these tools will erode my ability to make an independent diagnosis," a medical student said. In 2017, Matthew Williams, a thirtysomething software engineer with an athletic build and a bald head, went for a long bike ride in the hills of San Francisco. Afterward, at dinner with some friends, he ordered a hamburger, fries, and a milkshake. Midway through the meal, he felt so full that he had to ask someone to drive him home. That night, Williams awoke with a sharp pain in his abdomen that he worried was appendicitis. He went to a nearby emergency clinic, where doctors told him that he was probably constipated. They gave him some laxatives and sent him on his way. A few hours later, Williams's pain intensified. He vomited and felt as though his stomach might burst. A friend took him to a hospital, where a CT scan revealed cecal volvulus--a medical emergency in which part of the intestine twists in on itself, cutting off the digestive tract. The previous medical team had missed the condition, and may even have exacerbated it by giving him laxatives. Williams was rushed to the operating room, where surgeons removed about six feet of his intestines. After recovering from surgery, Williams began to experience severe diarrhea almost every time he ate. Doctors told him that his bowel just needed time to heal. "It got to the point where I couldn't go out, because I would constantly eat something that would make me sick," he said.


Advancing Medical Artificial Intelligence Using a Century of Cases

Buckley, Thomas A., Conci, Riccardo, Brodeur, Peter G., Gusdorf, Jason, Beltrán, Sourik, Behrouzi, Bita, Crowe, Byron, Dockterman, Jacob, Muhammad, Muzzammil, Ohnigian, Sarah, Sanchez, Andrew, Diao, James A., Shah, Aashna P., Restrepo, Daniel, Rosenberg, Eric S., Lea, Andrew S., Zitnik, Marinka, Podolsky, Scott H., Kanjee, Zahir, Abdulnour, Raja-Elie E., Koshy, Jacob M., Rodman, Adam, Manrai, Arjun K.

arXiv.org Artificial Intelligence

BACKGROUND: For over a century, the New England Journal of Medicine Clinicopathological Conferences (CPCs) have tested the reasoning of expert physicians and, recently, artificial intelligence (AI). However, prior AI evaluations have focused on final diagnoses without addressing the multifaceted reasoning and presentation skills required of expert discussants. METHODS: Using 7102 CPCs (1923-2025) and 1021 Image Challenges (2006-2025), we conducted extensive physician annotation and automated processing to create CPC-Bench, a physician-validated benchmark spanning 10 text-based and multimodal tasks, against which we evaluated leading large language models (LLMs). Then, we developed "Dr. CaBot," an AI discussant designed to produce written and slide-based video presentations using only the case presentation, modeling the role of the human expert in these cases. RESULTS: When challenged with 377 contemporary CPCs, o3 (OpenAI) ranked the final diagnosis first in 60% of cases and within the top ten in 84% of cases, outperforming a 20-physician baseline; next-test selection accuracy reached 98%. Event-level physician annotations quantified AI diagnostic accuracy per unit of information. Performance was lower on literature search and image tasks; o3 and Gemini 2.5 Pro (Google) achieved 67% accuracy on image challenges. In blinded comparisons of CaBot vs. human expert-generated text, physicians misclassified the source of the differential in 46 of 62 (74%) of trials, and scored CaBot more favorably across quality dimensions. To promote research, we are releasing CaBot and CPC-Bench. CONCLUSIONS: LLMs exceed physician performance on complex text-based differential diagnosis and convincingly emulate expert medical presentations, but image interpretation and literature retrieval remain weaker. CPC-Bench and CaBot may enable transparent and continued tracking of progress in medical AI.


Low-code to fight climate change: the Climaborough project

Conrardy, Aaron, Sulejmani, Armen, Guerlain, Cindy, Pagani, Daniele, Hick, David, Satta, Matteo, Cabot, Jordi

arXiv.org Artificial Intelligence

The EU-funded Climaborough project supports European cities to achieve carbon neutrality by 2030. Eleven cities in nine countries will deploy in real conditions products and services fostering climate transition in their local environment. The Climaborough City Platform is being developed to monitor the cities' overall progress towards their climate goals by aggregating historic and real-time data and displaying the results in user-friendly dashboards that will be used by non-technical experts to evaluate the effectiveness of local experimental initiatives, identify those that yield significant impact, and assess the potential consequences of scaling them up to a broader level. In this paper, we explain how we have put in place a low-code/no-code strategy in Climaborough in response to the project's aim to quickly deploy climate dashboards. A low-code strategy is used to accelerate the development of the dashboards. The dashboards embed a no-code philosophy that enables all types of citizen profiles to configure and adapt the dashboard to their specific needs.


The Series' Second Movie Beat em Citizen Kane /em on Rotten Tomatoes. The New One Is a Whole Different Animal.

Slate

The past decade has brought the world a lot of political and economic chaos, but in its defense, that same span of time has also given us the Paddington Bear movies. With those two London-set adventures, a mix of animation (Paddington) and live action (everyone else), director Paul King created a loopy world all his own, as cozy and visually pleasing as a dollhouse. The Paddington films were also refreshingly gentle, with moral messages that emerged not from preachy dialogue but from their ursine protagonist's unassuming goodness. And Ben Whishaw's voice performance as the unfailingly polite, naively bumbling bear is one of the all-time great matches between actor and animated character, up there with Tom Hanks' Woody in the Toy Story films: Whishaw quite simply is Paddington, and the completeness and believability of his characterization would have set the films apart even without their droll scripts and all-in supporting casts. The third film in the series, Paddington in Peru, ran a high risk of becoming a shark-jumping sequel, with King and his co-writers now replaced by first-time feature director Dougal Wilson and a new writing team consisting of Mark Burton, Jon Foster, and James Lamont.


Biometrics, AI, machine learning innovations to boost gaming industry growth

#artificialintelligence

Casino executives, industry analysts and lawyers attended a conference at the UNLV Boyd School of Law to consult on how biometrics, AI and machine learning could shape the future of Las Vegas casinos, writes the Nevada Independent. While there are many opportunities for the gaming industry, most machine learning and facial recognition-enabled product ideas addressed customer service and customer recognition. These include slot machines that leverage facial biometrics to recognize important or banned players, and reduce fraud attempts, or facial recognition-equipped tables to help pit managers identify and track known players. "What we're seeing is this introduction of technology into the gaming industry in ways we've never seen before, and because of it, it started to raise issues -- or questions -- as to how this works and what the ramifications could be for things like patron privacy, anonymity and data protection," said Anthony Cabot, Distinguished Fellow in Gaming Law at the UNLV Boyd School of Law and event organizer. While speakers focused on presentations about competing laws and technology problems, there was not enough discussion on how to solve these problems, according to the report, yet Cabot hopes the gaming industry and regulators will join forces to deliver solutions.


Questions Arising from a Proto-Neural Cognitive Architecture

Huyck, Christian Robert (Middlesex University) | Byrne, Emma Louise (Middlesex University)

AAAI Conferences

A neural cognitive architecture would be an architecture based on simulated neurons, that provided a set of mechanisms for all cognitive behaviour. Moreover, this would be compatible with biological neural behaviour. As a result, such architectures can both form the basis of a fully-fledged AI and help to explain how cognition emerges from a collection of neurons in the human brain. The development of such a neural cognitive architecture is in its infancy, but a proto-architecture in the form of behaving agents entirely based on simulated neurons is described. These agents take natural language commands, view the environment, plan and act. The development of these agents has led to a series of questions that need to be addressed to advance the development of neural cognitive architectures. These questions include long posed ones where progress has been made, such as the binding and symbol grounding problems; issues about biological architectures including neural models and brain topology; issues of emergent behaviour such as short and long-term Cell Assembly dynamics; and issues of learning such as the stability-plasticity dilemma. These questions can act as a road map for the development of neural cognitive architectures and AIs based on them.