Collaborating Authors


Alchemab Therapeutics Ltd hiring Machine Learning Scientist in Babraham, England, United Kingdom


The candidate will work within the technology team to develop, apply, and design novel machine learning (ML) algorithms with the ultimate aim of discovering therapeutic antibodies from next-generation sequencing (NGS) datasets. The candidate will be involved in multiple projects spanning our oncology, neuroscience, and infectious disease programmes. You will be responsible for the growth and development of our ML product roadmap. This will initially focus on exploiting methods in natural language processing for antibody discovery and patient stratification, and exploring the latest advances in ML (in areas such as self-supervised learning) to extend our capabilities. You will contribute new algorithms and strategies to increase accuracy, explainability, and/or automation of our technology platform.

Forecasting: theory and practice Machine Learning

Forecasting has always been at the forefront of decision making and planning. The uncertainty that surrounds the future is both exciting and challenging, with individuals and organisations seeking to minimise risks and maximise utilities. The large number of forecasting applications calls for a diverse set of forecasting methods to tackle real-life challenges. This article provides a non-systematic review of the theory and the practice of forecasting. We provide an overview of a wide range of theoretical, state-of-the-art models, methods, principles, and approaches to prepare, produce, organise, and evaluate forecasts. We then demonstrate how such theoretical concepts are applied in a variety of real-life contexts. We do not claim that this review is an exhaustive list of methods and applications. However, we wish that our encyclopedic presentation will offer a point of reference for the rich work that has been undertaken over the last decades, with some key insights for the future of forecasting theory and practice. Given its encyclopedic nature, the intended mode of reading is non-linear. We offer cross-references to allow the readers to navigate through the various topics. We complement the theoretical concepts and applications covered by large lists of free or open-source software implementations and publicly-available databases.

New model improves accuracy of machine learning in COVID-19 diagnosis while preserving privacy


Researchers in the UK and China have developed an artificial intelligence (AI) model that can diagnose COVID-19 as well as a panel of professional radiologists, while preserving the privacy of patient data. The international team, led by the University of Cambridge and the Huazhong University of Science and Technology, used a technique called federated learning to build their model. Using federated learning, an AI model in one hospital or country can be independently trained and verified using a dataset from another hospital or country, without data sharing. The researchers based their model on more than 9,000 CT scans from approximately 3,300 patients in 23 hospitals in the UK and China. Their results, reported in the journal Nature Machine Intelligence, provide a framework where AI techniques can be made more trustworthy and accurate, especially in areas such as medical diagnosis where privacy is vital.

NLP Methods for Extraction of Symptoms from Unstructured Data for Use in Prognostic COVID-19 Analytic Models

Journal of Artificial Intelligence Research

Statistical modeling of outcomes based on a patient's presenting symptoms (symptomatology) can help deliver high quality care and allocate essential resources, which is especially important during the COVID-19 pandemic. Patient symptoms are typically found in unstructured notes, and thus not readily available for clinical decision making. In an attempt to fill this gap, this study compared two methods for symptom extraction from Emergency Department (ED) admission notes. Both methods utilized a lexicon derived by expanding The Center for Disease Control and Prevention's (CDC) Symptoms of Coronavirus list. The first method utilized a word2vec model to expand the lexicon using a dictionary mapping to the Uni ed Medical Language System (UMLS). The second method utilized the expanded lexicon as a rule-based gazetteer and the UMLS. These methods were evaluated against a manually annotated reference (f1-score of 0.87 for UMLS-based ensemble; and 0.85 for rule-based gazetteer with UMLS). Through analyses of associations of extracted symptoms used as features against various outcomes, salient risks among the population of COVID-19 patients, including increased risk of in-hospital mortality (OR 1.85, p-value < 0.001), were identified for patients presenting with dyspnea. Disparities between English and non-English speaking patients were also identified, the most salient being a concerning finding of opposing risk signals between fatigue and in-hospital mortality (non-English: OR 1.95, p-value = 0.02; English: OR 0.63, p-value = 0.01). While use of symptomatology for modeling of outcomes is not unique, unlike previous studies this study showed that models built using symptoms with the outcome of in-hospital mortality were not significantly different from models using data collected during an in-patient encounter (AUC of 0.9 with 95% CI of [0.88, 0.91] using only vital signs; AUC of 0.87 with 95% CI of [0.85, 0.88] using only symptoms). These findings indicate that prognostic models based on symptomatology could aid in extending COVID-19 patient care through telemedicine, replacing the need for in-person options. The methods presented in this study have potential for use in development of symptomatology-based models for other diseases, including for the study of Post-Acute Sequelae of COVID-19 (PASC).

Anticipating Safety Issues in E2E Conversational AI: Framework and Tooling Artificial Intelligence

Over the last several years, end-to-end neural conversational agents have vastly improved in their ability to carry a chit-chat conversation with humans. However, these models are often trained on large datasets from the internet, and as a result, may learn undesirable behaviors from this data, such as toxic or otherwise harmful language. Researchers must thus wrestle with the issue of how and when to release these models. In this paper, we survey the problem landscape for safety for end-to-end conversational AI and discuss recent and related work. We highlight tensions between values, potential positive impact and potential harms, and provide a framework for making decisions about whether and how to release these models, following the tenets of value-sensitive design. We additionally provide a suite of tools to enable researchers to make better-informed decisions about training and releasing end-to-end conversational AI models.

"Thought I'd Share First": An Analysis of COVID-19 Conspiracy Theories and Misinformation Spread on Twitter Machine Learning

Background: Misinformation spread through social media is a growing problem, and the emergence of COVID-19 has caused an explosion in new activity and renewed focus on the resulting threat to public health. Given this increased visibility, in-depth analysis of COVID-19 misinformation spread is critical to understanding the evolution of ideas with potential negative public health impact. Methods: Using a curated data set of COVID-19 tweets (N ~120 million tweets) spanning late January to early May 2020, we applied methods including regular expression filtering, supervised machine learning, sentiment analysis, geospatial analysis, and dynamic topic modeling to trace the spread of misinformation and to characterize novel features of COVID-19 conspiracy theories. Results: Random forest models for four major misinformation topics provided mixed results, with narrowly-defined conspiracy theories achieving F1 scores of 0.804 and 0.857, while more broad theories performed measurably worse, with scores of 0.654 and 0.347. Despite this, analysis using model-labeled data was beneficial for increasing the proportion of data matching misinformation indicators. We were able to identify distinct increases in negative sentiment, theory-specific trends in geospatial spread, and the evolution of conspiracy theory topics and subtopics over time. Conclusions: COVID-19 related conspiracy theories show that history frequently repeats itself, with the same conspiracy theories being recycled for new situations. We use a combination of supervised learning, unsupervised learning, and natural language processing techniques to look at the evolution of theories over the first four months of the COVID-19 outbreak, how these theories intertwine, and to hypothesize on more effective public health messaging to combat misinformation in online spaces.

On the Nature and Types of Anomalies: A Review Artificial Intelligence

Anomalies are occurrences in a dataset that are in some way unusual and do not fit the general patterns. The concept of the anomaly is generally ill-defined and perceived as vague and domain-dependent. Moreover, no comprehensive and concrete overviews of the different types of anomalies have hitherto been published. By means of an extensive literature review this study therefore offers the first theoretically principled and domain-independent typology of data anomalies, and presents a full overview of anomaly types and subtypes. To concretely define the concept of the anomaly and its different manifestations the typology employs four dimensions: data type, cardinality of relationship, data structure and data distribution. These fundamental and data-centric dimensions naturally yield 3 broad groups, 9 basic types and 61 subtypes of anomalies. The typology facilitates the evaluation of the functional capabilities of anomaly detection algorithms, contributes to explainable data science, and provides insights into relevant topics such as local versus global anomalies.

GPT-3 Creative Fiction


What if I told a story here, how would that story start?" Thus, the summarization prompt: "My second grader asked me what this passage means: …" When a given prompt isn't working and GPT-3 keeps pivoting into other modes of completion, that may mean that one hasn't constrained it enough by imitating a correct output, and one needs to go further; writing the first few words or sentence of the target output may be necessary.

A data science approach to drug safety: Semantic and visual mining of adverse drug events from clinical trials of pain treatments Artificial Intelligence

Clinical trials are the basis of Evidence-Based Medicine. Trial results are reviewed by experts and consensus panels for producing meta-analyses and clinical practice guidelines. However, reviewing these results is a long and tedious task, hence the meta-analyses and guidelines are not updated each time a new trial is published. Moreover, the independence of experts may be difficult to appraise. On the contrary, in many other domains, including medical risk analysis, the advent of data science, big data and visual analytics allowed moving from expert-based to fact-based knowledge. Since 12 years, many trial results are publicly available online in trial registries. Nevertheless, data science methods have not yet been applied widely to trial data. In this paper, we present a platform for analyzing the safety events reported during clinical trials and published in trial registries. This platform is based on an ontological model including 582 trials on pain treatments, and uses semantic web technologies for querying this dataset at various levels of granularity. It also relies on a 26-dimensional flower glyph for the visualization of the Adverse Drug Events (ADE) rates in 13 categories and 2 levels of seriousness. We illustrate the interest of this platform through several use cases and we were able to find back conclusions that are known in the literature. The platform was presented to four experts in drug safety, and is publicly available online, with the ontology of pain treatment ADE.

An Investigation of COVID-19 Spreading Factors with Explainable AI Techniques Artificial Intelligence

Since COVID-19 was first identified in December 2019, various public health interventions have been implemented across the world. As different measures are implemented at different countries at different times, we conduct an assessment of the relative effectiveness of the measures implemented in 18 countries and regions using data from 22/01/2020 to 02/04/2020. We compute the top one and two measures that are most effective for the countries and regions studied during the period. Two Explainable AI techniques, SHAP and ECPI, are used in our study; such that we construct (machine learning) models for predicting the instantaneous reproduction number ($R_t$) and use the models as surrogates to the real world and inputs that the greatest influence to our models are seen as measures that are most effective. Across-the-board, city lockdown and contact tracing are the two most effective measures. For ensuring $R_t<1$, public wearing face masks is also important. Mass testing alone is not the most effective measure although when paired with other measures, it can be effective. Warm temperature helps for reducing the transmission.