Goto

Collaborating Authors

 Niue


Impact of clinical decision support systems (cdss) on clinical outcomes and healthcare delivery in low- and middle-income countries: protocol for a systematic review and meta-analysis

Jain, Garima, Bodade, Anand, Pati, Sanghamitra

arXiv.org Artificial Intelligence

Clinical decision support systems (CDSS) are used to improve clinical and service outcomes, yet evidence from low- and middle-income countries (LMICs) is dispersed. This protocol outlines methods to quantify the impact of CDSS on patient and healthcare delivery outcomes in LMICs. We will include comparative quantitative designs (randomized trials, controlled before-after, interrupted time series, comparative cohorts) evaluating CDSS in World Bank-defined LMICs. Standalone qualitative studies are excluded; mixed-methods studies are eligible only if they report comparative quantitative outcomes, for which we will extract the quantitative component. Searches (from inception to 30 September 2024) will cover MEDLINE, Embase, CINAHL, CENTRAL, Web of Science, Global Health, Scopus, IEEE Xplore, LILACS, African Index Medicus, and IndMED, plus grey sources. Screening and extraction will be performed in duplicate. Risk of bias will be assessed with RoB 2 (randomized trials) and ROBINS-I (non-randomized). Random-effects meta-analysis will be performed where outcomes are conceptually or statistically comparable; otherwise, a structured narrative synthesis will be presented. Heterogeneity will be explored using relative and absolute metrics and a priori subgroups or meta-regression (condition area, care level, CDSS type, readiness proxies, study design).


Compress to Impress: Efficient LLM Adaptation Using a Single Gradient Step on 100 Samples

Sreeram, Shiva, Maalouf, Alaa, Sharma, Pratyusha, Rus, Daniela

arXiv.org Artificial Intelligence

Recently, Sharma et al. suggested a method called Layer-SElective-Rank reduction (LASER) which demonstrated that pruning high-order components of carefully chosen LLM's weight matrices can boost downstream accuracy -- without any gradient-based fine-tuning. Yet LASER's exhaustive, per-matrix search (each requiring full-dataset forward passes) makes it impractical for rapid deployment. We demonstrate that this overhead can be removed and find that: (i) Only a small, carefully chosen subset of matrices needs to be inspected -- eliminating the layer-by-layer sweep, (ii) The gradient of each matrix's singular values pinpoints which matrices merit reduction, (iii) Increasing the factorization search space by allowing matrices rows to cluster around multiple subspaces and then decomposing each cluster separately further reduces overfitting on the original training data and further lifts accuracy by up to 24.6 percentage points, and finally, (iv) we discover that evaluating on just 100 samples rather than the full training data -- both for computing the indicative gradients and for measuring the final accuracy -- suffices to further reduce the search time; we explain that as adaptation to downstream tasks is dominated by prompting style, not dataset size. As a result, we show that combining these findings yields a fast and robust adaptation algorithm for downstream tasks. Overall, with a single gradient step on 100 examples and a quick scan of the top candidate layers and factorization techniques, we can adapt LLMs to new datasets -- entirely without fine-tuning.


Evaluating Large Language Models for IUCN Red List Species Information

Uryu, Shinya

arXiv.org Artificial Intelligence

Large Language Models (LLMs) are rapidly being adopted in conservation to address the biodiversity crisis, yet their reliability for species evaluation is uncertain. This study systematically validates five leading models on 21,955 species across four core IUCN Red List assessment components: taxonomy, conservation status, distribution, and threats. A critical paradox was revealed: models excelled at taxonomic classification (94.9%) but consistently failed at conservation reasoning (27.2% for status assessment). This knowledge-reasoning gap, evident across all models, suggests inherent architectural constraints, not just data limitations. Furthermore, models exhibited systematic biases favoring charismatic vertebrates, potentially amplifying existing conservation inequities. These findings delineate clear boundaries for responsible LLM deployment: they are powerful tools for information retrieval but require human oversight for judgment-based decisions. A hybrid approach is recommended, where LLMs augment expert capacity while human experts retain sole authority over risk assessment and policy.


Deep learning four decades of human migration

Gaskin, Thomas, Abel, Guy J.

arXiv.org Artificial Intelligence

W e present a novel and detailed dataset on origin-destination annual migration flows and stocks between 230 countries and regions, spanning the period from 1990 to the present. Our flow estimates are further disaggregated by country of birth, providing a comprehensive picture of migration over the last 35 years. The estimates are obtained by training a deep recurrent neural network to learn flow patterns from 18 covariates for all countries, including geographic, economic, cultural, societal, and political information. The recurrent architecture of the neural network means that the entire past can influence current migration patterns, allowing us to learn long-range temporal correlations. By training an ensemble of neural networks and additionally pushing uncertainty on the covariates through the trained network, we obtain confidence bounds for all our estimates, allowing researchers to pinpoint the geographic regions most in need of additional data collection. W e validate our approach on various test sets of unseen data, demonstrating that it significantly outperforms traditional methods estimating five-year flows while delivering a significant increase in temporal resolution. The model is fully open source: all training data, neural network weights, and training code are made public alongside the migration estimates, providing a valuable resource for future studies of human migration.


An Expanded Massive Multilingual Dataset for High-Performance Language Technologies

Burchell, Laurie, de Gibert, Ona, Arefyev, Nikolay, Aulamo, Mikko, Bañón, Marta, Chen, Pinzhen, Fedorova, Mariia, Guillou, Liane, Haddow, Barry, Hajič, Jan, Helcl, Jindřich, Henriksson, Erik, Klimaszewski, Mateusz, Komulainen, Ville, Kutuzov, Andrey, Kytöniemi, Joona, Laippala, Veronika, Mæhlum, Petter, Malik, Bhavitvya, Mehryary, Farrokh, Mikhailov, Vladislav, Moghe, Nikita, Myntti, Amanda, O'Brien, Dayyán, Oepen, Stephan, Pal, Proyag, Piha, Jousia, Pyysalo, Sampo, Ramírez-Sánchez, Gema, Samuel, David, Stepachev, Pavel, Tiedemann, Jörg, Variš, Dušan, Vojtěchová, Tereza, Zaragoza-Bernabeu, Jaume

arXiv.org Artificial Intelligence

Training state-of-the-art large language models requires vast amounts of clean and diverse textual data. However, building suitable multilingual datasets remains a challenge. In this work, we present HPLT v2, a collection of high-quality multilingual monolingual and parallel corpora. The monolingual portion of the data contains 8T tokens covering 193 languages, while the parallel data contains 380M sentence pairs covering 51 languages. We document the entire data pipeline and release the code to reproduce it. We provide extensive analysis of the quality and characteristics of our data. Finally, we evaluate the performance of language models and machine translation systems trained on HPLT v2, demonstrating its value.


Programming by Examples Meets Historical Linguistics: A Large Language Model Based Approach to Sound Law Induction

Naik, Atharva, Agrawal, Darsh, Sng, Hong, Marr, Clayton, Zhang, Kexun, Robinson, Nathaniel R, Chang, Kalvin, Byrnes, Rebecca, Mysore, Aravind, Rose, Carolyn, Mortensen, David R

arXiv.org Artificial Intelligence

Historical linguists have long written "programs" that convert reconstructed words in an ancestor language into their attested descendants via ordered string rewrite functions (called sound laws) However, writing these programs is time-consuming, motivating the development of automated Sound Law Induction (SLI) which we formulate as Programming by Examples (PBE) with Large Language Models (LLMs) in this paper. While LLMs have been effective for code generation, recent work has shown that PBE is challenging but improvable by fine-tuning, especially with training data drawn from the same distribution as evaluation data. In this paper, we create a conceptual framework of what constitutes a "similar distribution" for SLI and propose four kinds of synthetic data generation methods with varying amounts of inductive bias to investigate what leads to the best performance. Based on the results we create a SOTA open-source model for SLI as PBE (+6% pass rate with a third of the parameters of the second-best LLM) and also highlight exciting future directions for PBE research.


MIRAI: Evaluating LLM Agents for Event Forecasting

Ye, Chenchen, Hu, Ziniu, Deng, Yihe, Huang, Zijie, Ma, Mingyu Derek, Zhu, Yanqiao, Wang, Wei

arXiv.org Artificial Intelligence

Recent advancements in Large Language Models (LLMs) have empowered LLM agents to autonomously collect world information, over which to conduct reasoning to solve complex problems. Given this capability, increasing interests have been put into employing LLM agents for predicting international events, which can influence decision-making and shape policy development on an international scale. Despite such a growing interest, there is a lack of a rigorous benchmark of LLM agents' forecasting capability and reliability. To address this gap, we introduce MIRAI, a novel benchmark designed to systematically evaluate LLM agents as temporal forecasters in the context of international events. Our benchmark features an agentic environment with tools for accessing an extensive database of historical, structured events and textual news articles. We refine the GDELT event database with careful cleaning and parsing to curate a series of relational prediction tasks with varying forecasting horizons, assessing LLM agents' abilities from short-term to long-term forecasting. We further implement APIs to enable LLM agents to utilize different tools via a code-based interface. In summary, MIRAI comprehensively evaluates the agents' capabilities in three dimensions: 1) autonomously source and integrate critical information from large global databases; 2) write codes using domain-specific APIs and libraries for tool-use; and 3) jointly reason over historical knowledge from diverse formats and time to accurately predict future events. Through comprehensive benchmarking, we aim to establish a reliable framework for assessing the capabilities of LLM agents in forecasting international events, thereby contributing to the development of more accurate and trustworthy models for international relation analysis.


modeLing: A Novel Dataset for Testing Linguistic Reasoning in Language Models

Chi, Nathan A., Malchev, Teodor, Kong, Riley, Chi, Ryan A., Huang, Lucas, Chi, Ethan A., McCoy, R. Thomas, Radev, Dragomir

arXiv.org Artificial Intelligence

We introduce modeLing, a novel benchmark of Linguistics Olympiad-style puzzles which tests few-shot reasoning in AI systems. Solving these puzzles necessitates inferring aspects of a language's grammatical structure from a small number of examples. Such puzzles provide a natural testbed for language models, as they require compositional generalization and few-shot inductive reasoning. Consisting solely of new puzzles written specifically for this work, modeLing has no risk of appearing in the training data of existing AI systems: this ameliorates the risk of data leakage, a potential confounder for many prior evaluations of reasoning. Evaluating several large open source language models and GPT on our benchmark, we observe non-negligible accuracy, demonstrating few-shot emergent reasoning ability which cannot merely be attributed to shallow memorization. However, imperfect model performance suggests that modeLing can be used to measure further progress in linguistic reasoning.


Can Large Language Models Code Like a Linguist?: A Case Study in Low Resource Sound Law Induction

Naik, Atharva, Zhang, Kexun, Robinson, Nathaniel, Mysore, Aravind, Marr, Clayton, Byrnes, Hong Sng Rebecca, Cai, Anna, Chang, Kalvin, Mortensen, David

arXiv.org Artificial Intelligence

Historical linguists have long written a kind of incompletely formalized ''program'' that converts reconstructed words in an ancestor language into words in one of its attested descendants that consist of a series of ordered string rewrite functions (called sound laws). They do this by observing pairs of words in the reconstructed language (protoforms) and the descendent language (reflexes) and constructing a program that transforms protoforms into reflexes. However, writing these programs is error-prone and time-consuming. Prior work has successfully scaffolded this process computationally, but fewer researchers have tackled Sound Law Induction (SLI), which we approach in this paper by casting it as Programming by Examples. We propose a language-agnostic solution that utilizes the programming ability of Large Language Models (LLMs) by generating Python sound law programs from sound change examples. We evaluate the effectiveness of our approach for various LLMs, propose effective methods to generate additional language-agnostic synthetic data to fine-tune LLMs for SLI, and compare our method with existing automated SLI methods showing that while LLMs lag behind them they can complement some of their weaknesses.


Ranking Entities along Conceptual Space Dimensions with LLMs: An Analysis of Fine-Tuning Strategies

Kumar, Nitesh, Chatterjee, Usashi, Schockaert, Steven

arXiv.org Artificial Intelligence

Conceptual spaces represent entities in terms of their primitive semantic features. Such representations are highly valuable but they are notoriously difficult to learn, especially when it comes to modelling perceptual and subjective features. Distilling conceptual spaces from Large Language Models (LLMs) has recently emerged as a promising strategy, but existing work has been limited to probing pre-trained LLMs using relatively simple zero-shot strategies. We focus in particular on the task of ranking entities according to a given conceptual space dimension. Unfortunately, we cannot directly fine-tune LLMs on this task, because ground truth rankings for conceptual space dimensions are rare. We therefore use more readily available features as training data and analyse whether the ranking capabilities of the resulting models transfer to perceptual and subjective features. We find that this is indeed the case, to some extent, but having at least some perceptual and subjective features in the training data seems essential for achieving the best results.