Chauhan, Geeticka
Training Large ASR Encoders with Differential Privacy
Chauhan, Geeticka, Chien, Steve, Thakkar, Om, Thakurta, Abhradeep, Narayanan, Arun
Self-supervised learning (SSL) methods for large speech models have proven to be highly effective at ASR. With the interest in public deployment of large pre-trained models, there is a rising concern for unintended memorization and leakage of sensitive data points from the training data. In this paper, we apply differentially private (DP) pre-training to a SOTA Conformer-based encoder, and study its performance on a downstream ASR task assuming the fine-tuning data is public. This paper is the first to apply DP to SSL for ASR, investigating the DP noise tolerance of the BEST-RQ pre-training method. Notably, we introduce a novel variant of model pruning called gradient-based layer freezing that provides strong improvements in privacy-utility-compute trade-offs. Our approach yields a LibriSpeech test-clean/other WER (%) of 3.78/ 8.41 with ($10$, 1e^-9)-DP for extrapolation towards low dataset scales, and 2.81/ 5.89 with (10, 7.9e^-11)-DP for extrapolation towards high scales.
How Good Is NLP? A Sober Look at NLP Tasks through the Lens of Social Impact
Jin, Zhijing, Chauhan, Geeticka, Tse, Brian, Sachan, Mrinmaya, Mihalcea, Rada
Recent years have seen many breakthroughs in natural language processing (NLP), transitioning it from a mostly theoretical field to one with many real-world applications. Noting the rising number of applications of other machine learning and AI techniques with pervasive societal impact, we anticipate the rising importance of developing NLP technologies for social good. Inspired by theories in moral philosophy and global priorities research, we aim to promote a guideline for social good in the context of NLP. We lay the foundations via moral philosophy's definition of social good, propose a framework to evaluate NLP tasks' direct and indirect real-world impact, and adopt the methodology of global priorities research to identify priority causes for NLP research. Finally, we use our theoretical framework to provide some practical guidelines for future NLP research for social good. Our data and codes are available at http://github.com/zhijing-jin/nlp4sg_acl2021
MIMIC-Extract: A Data Extraction, Preprocessing, and Representation Pipeline for MIMIC-III
Wang, Shirly, McDermott, Matthew B. A., Chauhan, Geeticka, Hughes, Michael C., Naumann, Tristan, Ghassemi, Marzyeh
Robust machine learning relies on access to data that can be used with standardized frameworks in important tasks and the ability to develop models whose performance can be reasonably reproduced. In machine learning for healthcare, the community faces reproducibility challenges due to a lack of publicly accessible data and a lack of standardized data processing frameworks. We present MIMIC-Extract, an open-source pipeline for transforming raw electronic health record (EHR) data for critical care patients contained in the publicly-available MIMIC-III database into dataframes that are directly usable in common machine learning pipelines. MIMIC-Extract addresses three primary challenges in making complex health records data accessible to the broader machine learning community. First, it provides standardized data processing functions, including unit conversion, outlier detection, and aggregating semantically equivalent features, thus accounting for duplication and reducing missingness. Second, it preserves the time series nature of clinical data and can be easily integrated into clinically actionable prediction tasks in machine learning for health. Finally, it is highly extensible so that other researchers with related questions can easily use the same pipeline. We demonstrate the utility of this pipeline by showcasing several benchmark tasks and baseline results. These authors has an equal contribution, and should be considered co-first authors.
Rethinking clinical prediction: Why machine learning must consider year of care and feature aggregation
Nestor, Bret, McDermott, Matthew B. A., Chauhan, Geeticka, Naumann, Tristan, Hughes, Michael C., Goldenberg, Anna, Ghassemi, Marzyeh
Machine learning for healthcare often trains models on de-identified datasets with randomly-shifted calendar dates, ignoring the fact that data were generated under hospital operation practices that change over time. These changing practices induce definitive changes in observed data which confound evaluations which do not account for dates and limit the generalisability of date-agnostic models. In this work, we establish the magnitude of this problem on MIMIC, a public hospital dataset, and showcase a simple solution. We augment MIMIC with the year in which care was provided and show that a model trained using standard feature representations will significantly degrade in quality over time. We find a deterioration of 0.3 AUC when evaluating mortality prediction on data from 10 years later. We find a similar deterioration of 0.15 AUC for length-of-stay. In contrast, we demonstrate that clinically-oriented aggregates of raw features significantly mitigate future deterioration. Our suggested aggregated representations, when retrained yearly, have prediction quality comparable to year-agnostic models.
Building on Word Animacy to Determine Coreference Chain Animacy in Cultural Narratives
Jahan, Labiba (Florida International University) | Chauhan, Geeticka (Florida International University) | Finlayson, Mark A. (Florida International University)
Animacy is the characteristic of being able to independently carry out actions in a story world (e.g., movement, communication). It is a necessary property of characters in stories, and so detecting animacy is an important step in automatic story understanding. Prior approaches to animacy detection have conceived of animacy as a word- or phrase-level property, without explicitly connecting it to characters. In this work we compute the animacy of referring expressions using a statistical approach incorporating features such as word embeddings on referring expression, noun, grammatical subject and semantic roles. We then compute the animacy of coreference chains via a majority vote of the animacy of the chain's constituent referring expressions. We also reimplement prior approaches to word-level animacy to compare performance. We demonstrate these results on a small set of folktales with gold-standard annotations for coreference structure and animacy (15 Russian folktales translated into English). Folktales present an interesting challenge because they often involve characters who are members of traditionally inanimate classes (e.g., stoves that walk, tree that talk). We achieve an F1 measure 0.90 for the referring expression animacy model, and 0.86 for the coreference chain model. We discuss several ways in which we anticipate these results may be improved in future work.