Bustos, Aurelia
PadChest-GR: A Bilingual Chest X-ray Dataset for Grounded Radiology Report Generation
Castro, Daniel C., Bustos, Aurelia, Bannur, Shruthi, Hyland, Stephanie L., Bouzid, Kenza, Wetscherek, Maria Teodora, Sánchez-Valverde, Maria Dolores, Jaques-Pérez, Lara, Pérez-Rodríguez, Lourdes, Takeda, Kenji, Salinas, José María, Alvarez-Valle, Javier, Herrero, Joaquín Galant, Pertusa, Antonio
Radiology report generation (RRG) aims to create free-text radiology reports from clinical imaging. Grounded radiology report generation (GRRG) extends RRG by including the localisation of individual findings on the image. Currently, there are no manually annotated chest X-ray (CXR) datasets to train GRRG models. In this work, we present a dataset called PadChest-GR (Grounded-Reporting) derived from PadChest aimed at training GRRG models for CXR images. We curate a public bi-lingual dataset of 4,555 CXR studies with grounded reports (3,099 abnormal and 1,456 normal), each containing complete lists of sentences describing individual present (positive) and absent (negative) findings in English and Spanish. In total, PadChest-GR contains 7,037 positive and 3,422 negative finding sentences. Every positive finding sentence is associated with up to two independent sets of bounding boxes labelled by different readers and has categorical labels for finding type, locations, and progression. To the best of our knowledge, PadChest-GR is the first manually curated dataset designed to train GRRG models for understanding and interpreting radiological images and generated text. By including detailed localization and comprehensive annotations of all clinically relevant findings, it provides a valuable resource for developing and evaluating GRRG models from CXR images. PadChest-GR can be downloaded under request from https://bimcv.cipf.es/bimcv-projects/padchest-gr/
Learning Eligibility in Clinical Cancer Trials using Deep Neural Networks
Bustos, Aurelia, Pertusa, Antonio
Interventional clinical cancer trials are generally too restrictive and cancer patients are often excluded from them on the basis of comorbidity, past or concomitant treatments and the fact that they are over a certain age. The efficacy and safety of new treatments for patients with these characteristics are not, therefore, defined. In this work, we build a model with which to automatically predict whether short clinical statements were considered inclusion or exclusion criteria. We used clinical trials protocols on cancer that have been available in public registries for the last 18 years to train word embeddings, and constructed a dataset of 6M short free-texts labeled as eligible or not eligible. We then trained and validated a text classifier, using deep neural networks with pre-trained word-embedding as its inputs, to predict whether or not short free-text statements describing clinical information were considered eligible. The best model achieved an F-measure of 0.92 and an almost perfect agreement when employing a validation set of 800K labeled statements. The trained model was also tested on an independent set of clinical statements mimicking those used in routine clinical practice, yielding a consistent performance. We additionally analyzed the semantic reasoning of the word embedding representations obtained, and were able to identify equivalent treatments for a type of tumor in an analogy with the drugs used to treat other tumors. The present work shows that representation learning using neural networks can be successfully leveraged to extract the medical knowledge available on clinical trial protocols and potentially assist practitioners when prescribing treatments.