Goto

Collaborating Authors

 oxford


MA-RAG: Multi-Agent Retrieval-Augmented Generation via Collaborative Chain-of-Thought Reasoning

Nguyen, Thang, Chin, Peter, Tai, Yu-Wing

arXiv.org Artificial Intelligence

We present MA-RAG, a Multi-Agent framework for Retrieval-Augmented Generation (RAG) that addresses the inherent ambiguities and reasoning challenges in complex information-seeking tasks. Unlike conventional RAG methods that rely on end-to-end fine-tuning or isolated component enhancements, MA-RAG orchestrates a collaborative set of specialized AI agents: Planner, Step Definer, Extractor, and QA Agents, each responsible for a distinct stage of the RAG pipeline. By decomposing tasks into subtasks such as query disambiguation, evidence extraction, and answer synthesis, and enabling agents to communicate intermediate reasoning via chain-of-thought prompting, MA-RAG progressively refines retrieval and synthesis while maintaining modular interpretability. Extensive experiments on multi-hop and ambiguous QA benchmarks, including NQ, HotpotQA, 2WikimQA, and TriviaQA, demonstrate that MA-RAG significantly outperforms standalone LLMs and existing RAG methods across all model scales. Notably, even a small LLaMA3-8B model equipped with MA-RAG surpasses larger standalone LLMs, while larger variants (LLaMA3-70B and GPT-4o-mini) set new state-of-the-art results on challenging multi-hop datasets. Ablation studies reveal that both the planner and extractor agents are critical for multi-hop reasoning, and that high-capacity models are especially important for the QA agent to synthesize answers effectively. Beyond general-domain QA, MA-RAG generalizes to specialized domains such as medical QA, achieving competitive performance against domain-specific models without any domain-specific fine-tuning. Our results highlight the effectiveness of collaborative, modular reasoning in retrieval-augmented systems: MA-RAG not only improves answer accuracy and robustness but also provides interpretable intermediate reasoning steps, establishing a new paradigm for efficient and reliable multi-agent RAG.


Validation of a CT-brain analysis tool for measuring global cortical atrophy in older patient cohorts

Bal, Sukhdeep, Colbourne, Emma, Gan, Jasmine, Griffanti, Ludovica, Hanayik, Taylor, Demeyere, Nele, Davies, Jim, Pendlebury, Sarah T, Jenkinson, Mark

arXiv.org Artificial Intelligence

Quantification of brain atrophy currently requires visual rating scales which are time consuming and automated brain image analysis is warranted. We validated our automated deep learning (DL) tool measuring the Global Cerebral Atrophy (GCA) score against trained human raters, and associations with age and cognitive impairment, in representative older (>65 years) patients. CT-brain scans were obtained from patients in acute medicine (ORCHARD-EPR), acute stroke (OCS studies) and a legacy sample. Scans were divided in a 60/20/20 ratio for training, optimisation and testing. CT-images were assessed by two trained raters (rater-1=864 scans, rater-2=20 scans). Agreement between DL tool-predicted GCA scores (range 0-39) and the visual ratings was evaluated using mean absolute error (MAE) and Cohen's weighted kappa. Among 864 scans (ORCHARD-EPR=578, OCS=200, legacy scans=86), MAE between the DL tool and rater-1 GCA scores was 3.2 overall, 3.1 for ORCHARD-EPR, 3.3 for OCS and 2.6 for the legacy scans and half had DL-predicted GCA error between -2 and 2. Inter-rater agreement was Kappa=0.45 between the DL-tool and rater-1, and 0.41 between the tool and rater- 2 whereas it was lower at 0.28 for rater-1 and rater-2. There was no difference in GCA scores from the DL-tool and the two raters (one-way ANOVA, p=0.35) or in mean GCA scores between the DL-tool and rater-1 (paired t-test, t=-0.43, p=0.66), the tool and rater-2 (t=1.35, p=0.18) or between rater-1 and rater-2 (t=0.99, p=0.32). DL-tool GCA scores correlated with age and cognitive scores (both p<0.001). Our DL CT-brain analysis tool measured GCA score accurately and without user input in real-world scans acquired from older patients. Our tool will enable extraction of standardised quantitative measures of atrophy at scale for use in health data research and will act as proof-of-concept towards a point-of-care clinically approved tool.


Emerging Semantic Segmentation from Positive and Negative Coarse Label Learning

Zhang, Le, Wu, Fuping, Thirunavukarasu, Arun, Bronik, Kevin, Nichols, Thomas, Papiez, Bartlomiej W.

arXiv.org Artificial Intelligence

Large annotated datasets are vital for training segmentation models, but pixel-level labeling is time-consuming, error-prone, and often requires scarce expert annotators, especially in medical imaging. In contrast, coarse annotations are quicker, cheaper, and easier to produce, even by non-experts. In this paper, we propose to use coarse drawings from both positive (target) and negative (background) classes in the image, even with noisy pixels, to train a convolutional neural network (CNN) for semantic segmentation. We present a method for learning the true segmentation label distributions from purely noisy coarse annotations using two coupled CNNs. The separation of the two CNNs is achieved by high fidelity with the characters of the noisy training annotations. We propose to add a complementary label learning that encourages estimating negative label distribution. To illustrate the properties of our method, we first use a toy segmentation dataset based on MNIST. We then present the quantitative results of experiments using publicly available datasets: Cityscapes dataset for multi-class segmentation, and retinal images for medical applications. In all experiments, our method outperforms state-of-the-art methods, particularly in the cases where the ratio of coarse annotations is small compared to the given dense annotations.


Can AI-predicted complexes teach machine learning to compute drug binding affinity?

Hsu, Wei-Tse, Grevtsev, Savva, Douglas, Thomas, Magarkar, Aniket, Biggin, Philip C.

arXiv.org Artificial Intelligence

We evaluate the feasibility of using co-folding models for synthetic data augmentation in training machine learning-based scoring functions (MLSFs) for binding affinity prediction. Our results show that performance gains depend critically on the structural quality of augmented data. In light of this, we established simple heuristics for identifying high-quality co-folding predictions without reference structures, enabling them to substitute for experimental structures in MLSF training. Our study informs future data augmentation strategies based on co-folding models.


The 2025 PNPL Competition: Speech Detection and Phoneme Classification in the LibriBrain Dataset

Landau, Gilad, Özdogan, Miran, Elvers, Gereon, Mantegna, Francesco, Somaiya, Pratik, Jayalath, Dulhan, Kurth, Luisa, Kwon, Teyun, Shillingford, Brendan, Farquhar, Greg, Jiang, Minqi, Jerbi, Karim, Abdelhedi, Hamza, Ramos, Yorguin Mantilla, Gulcehre, Caglar, Woolrich, Mark, Voets, Natalie, Jones, Oiwi Parker

arXiv.org Artificial Intelligence

The advance of speech decoding from non-invasive brain data holds the potential for profound societal impact. Among its most promising applications is the restoration of communication to paralysed individuals affected by speech deficits such as dysarthria, without the need for high-risk surgical interventions. The ultimate aim of the 2025 PNPL competition is to produce the conditions for an "ImageNet moment" or breakthrough in non-invasive neural decoding, by harnessing the collective power of the machine learning community. To facilitate this vision we present the largest within-subject MEG dataset recorded to date (LibriBrain) together with a user-friendly Python library (pnpl) for easy data access and integration with deep learning frameworks. For the competition we define two foundational tasks (i.e. Speech Detection and Phoneme Classification from brain data), complete with standardised data splits and evaluation metrics, illustrative benchmark models, online tutorial code, a community discussion board, and public leaderboard for submissions. To promote accessibility and participation the competition features a Standard track that emphasises algorithmic innovation, as well as an Extended track that is expected to reward larger-scale computing, accelerating progress toward a non-invasive brain-computer interface for speech.


Can YOU decipher these scrolls? Scientists are offering a 400,000 prize if you can read a manuscript that was charred during the eruption of Mount Vesuvius

Daily Mail - Science & tech

They were turned to carbonized lumps by the catastrophic eruption of Mount Vesuvius in AD 79. Now, scientists are offering 400,000 to the person who can decipher the charred Herculaneum Scrolls. These ancient rolls of papyrus – a material similar to paper – are thought to contain profound philosophical and literary texts from ancient Greek and Roman scholars. The problem is that any attempts to unroll the burnt cylinders will turn them to dust, because they are so fragile. So, scientists have been turning to ingenious methods such as x-ray scanning, ink-detection software and AI to virtually'unroll' them.


People who had severe covid-19 show cognitive decline years later

New Scientist

The cognitive abilities of people who were hospitalised with covid-19 during the first wave of the pandemic remain lower than expected, even years later, and there is some evidence that this is forcing them to change jobs. "What we found is that the average cognitive deficit was equivalent to 10 IQ points, based on what would be expected for their age, et cetera," says Maxime Taquet at the University of Oxford. Does getting even mild covid-19 affect our cognitive skills? His team looked at 475 people in the UK who had been hospitalised with covid-19 and discharged before 31 March 2021. All had completed psychiatric and cognitive assessments six months after their discharge from hospital as part of another study.


Rapid Biomedical Research Classification: The Pandemic PACT Advanced Categorisation Engine

Rohanian, Omid, Nouriborji, Mohammadmahdi, Seminog, Olena, Furst, Rodrigo, Mendy, Thomas, Levanita, Shanthi, Kadri-Alab, Zaharat, Jabin, Nusrat, Toale, Daniela, Humphreys, Georgina, Antonio, Emilia, Bucher, Adrian, Norton, Alice, Clifton, David A.

arXiv.org Artificial Intelligence

This paper introduces the Pandemic PACT Advanced Categorisation Engine (PPACE) along with its associated dataset. PPACE is a fine-tuned model developed to automatically classify research abstracts from funded biomedical projects according to WHO-aligned research priorities. This task is crucial for monitoring research trends and identifying gaps in global health preparedness and response. Our approach builds on human-annotated projects, which are allocated one or more categories from a predefined list. A large language model is then used to generate `rationales' explaining the reasoning behind these annotations. This augmented data, comprising expert annotations and rationales, is subsequently used to fine-tune a smaller, more efficient model. Developed as part of the Pandemic PACT project, which aims to track and analyse research funding and clinical evidence for a wide range of diseases with outbreak potential, PPACE supports informed decision-making by research funders, policymakers, and independent researchers. We introduce and release both the trained model and the instruction-based dataset used for its training. Our evaluation shows that PPACE significantly outperforms its baselines. The release of PPACE and its associated dataset offers valuable resources for researchers in multilabel biomedical document classification and supports advancements in aligning biomedical research with key global health priorities.


'Eugenics on steroids': the toxic and contested legacy of Oxford's Future of Humanity Institute

The Guardian

Two weeks ago it was quietly announced that the Future of Humanity Institute, the renowned multidisciplinary research centre in Oxford, no longer had a future. It shut down without warning on 16 April. Initially there was just a brief statement on its website stating it had closed and that its research may continue elsewhere within and outside the university. The institute, which was dedicated to studying existential risks to humanity, was founded in 2005 by the Swedish-born philosopher Nick Bostrom and quickly made a name for itself beyond academic circles – particularly in Silicon Valley, where a number of tech billionaires sang its praises and provided financial support. Bostrom is perhaps best known for his bestselling 2014 book Superintelligence, which warned of the existential dangers of artificial intelligence, but he also gained widespread recognition for his 2003 academic paper "Are You Living in a Computer Simulation?".


Oxford shuts down institute run by Elon Musk-backed philosopher

The Guardian

Oxford University this week shut down an academic institute run by one of Elon Musk's favorite philosophers. The Future of Humanity Institute, dedicated to the long-termism movement and other Silicon Valley-endorsed ideas such as effective altruism, closed this week after 19 years of operation. Musk had donated 1m to the FIH in 2015 through a sister organization to research the threat of artificial intelligence. He had also boosted the ideas of its leader for nearly a decade on X, formerly Twitter. The center was run by Nick Bostrom, a Swedish-born philosopher whose writings about the long-term threat of AI replacing humanity turned him into a celebrity figure among the tech elite and routinely landed him on lists of top global thinkers. OpenAI chief executive Sam Altman, Microsoft founder Bill Gates and Tesla chief Musk all wrote blurbs for his 2014 bestselling book Superintelligence.