In this paper, our focus is the connection and influence of language technologies on the research in neurolinguistics. We present a review of brain imaging-based neurolinguistic studies with a focus on the natural language representations, such as word embeddings and pre-trained language models. Mutual enrichment of neurolinguistics and language technologies leads to development of brain-aware natural language representations. The importance of this research area is emphasized by medical applications.
Reaching a global view of brain organization requires assembling evidence on widely different mental processes and mechanisms. The variety of human neuroscience concepts and terminology poses a fundamental challenge to relating brain imaging results across the scientific literature. Existing meta-analysis methods perform statistical tests on sets of publications associated with a particular concept. Thus, large-scale meta-analyses only tackle single terms that occur frequently. We propose a new paradigm, focusing on prediction rather than inference. Our multivariate model predicts the spatial distribution of neurological observations, given text describing an experiment, cognitive process, or disease. This approach handles text of arbitrary length and terms that are too rare for standard meta-analysis. We capture the relationships and neural correlates of 7 547 neuroscience terms across 13 459 neuroimaging publications. The resulting meta-analytic tool, neuroquery.org, can ground hypothesis generation and data-analysis priors on a comprehensive view of published findings on the brain.
This paper introduces a novel classification method for functional magnetic resonance imaging datasets with tens of classes. The method is designed to make predictions using information from as many brain locations as possible, instead of resorting to feature selection, and does this by decomposing the pattern of brain activation into differently informative sub-regions. We provide results over a complex semantic processing dataset that show that the method is competitive with state-of-the-art feature selection and also suggest how the method may be used to perform group or exploratory analyses of complex class structure. Papers published at the Neural Information Processing Systems Conference.
Machine learning methods have recently achieved high-performance in biomedical text analysis. However, a major bottleneck in the widespread application of these methods is obtaining the required large amounts of annotated training data, which is resource intensive and time consuming. Recent progress in self-supervised learning has shown promise in leveraging large text corpora without explicit annotations. In this work, we built a self-supervised contextual language representation model using BERT, a deep bidirectional transformer architecture, to identify radiology reports requiring prompt communication to the referring physicians. We pre-trained the BERT model on a large unlabeled corpus of radiology reports and used the resulting contextual representations in a final text classifier for communication urgency. Our model achieved a precision of 97.0%, recall of 93.3%, and F-measure of 95.1% on an independent test set in identifying radiology reports for prompt communication, and significantly outperformed the previous state-of-the-art model based on word2vec representations.
Decision support tools that rely on supervised learning require large amounts of expert annotations. Using past radiological reports obtained from hospital archiving systems has many advantages as training data above manual single-class labels: they are expert annotations available in large quantities, covering a population-representative variety of pathologies, and they provide additional context to pathology diagnoses, such as anatomical location and severity. Learning to auto-generate such reports from images present many challenges such as the difficulty in representing and generating long, unstructured textual information, accounting for spelling errors and repetition/redundancy, and the inconsistency across different annotators. We therefore propose to first learn visually-informative medical concepts from raw reports, and, using the concept predictions as image annotations, learn to auto-generate structured reports directly from images. We validate our approach on the OpenI  chest x-ray dataset, which consists of frontal and lateral views of chest x-ray images, their corresponding raw textual reports and manual medical subject heading (MeSH ) annotations made by radiologists.
This work investigates multiple approaches to Named Entity Recognition (NER) for text in Electronic Health Record (EHR) data. In particular, we look into the application of (i) rule-based, (ii) deep learning and (iii) transfer learning systems for the task of NER on brain imaging reports with a focus on records from patients with stroke. We explore the strengths and weaknesses of each approach, develop rules and train on a common dataset, and evaluate each system's performance on common test sets of Scottish radiology reports from two sources (brain imaging reports in ESS -- Edinburgh Stroke Study data collected by NHS Lothian as well as radiology reports created in NHS Tayside). Our comparison shows that a hand-crafted system is the most accurate way to automatically label EHR, but machine learning approaches can provide a feasible alternative where resources for a manual system are not readily available.
Outlier detection has been studied extensively and employed in diverse applications in the past decades. In this paper we formulate a related yet understudied problem which we call outlier description. This problem often arises in practice when we have a small number of data instances that had been identified to be outliers and we wish to explain why they are outliers. We propose a framework based on constraint programming to find an optimal subset of features that most differentiates the outliers and normal instances. We further demonstrate the framework offers great flexibility in incorporating diverse scenarios arising in practice such as multiple explanations and human in the loop extensions. We empirically evaluate our proposed framework on real datasets, including medical imaging and text corpus, and demonstrate how the results are useful and interpretable in these domains.