Out-of-distribution evaluations of channel agnostic masked autoencoders in fluorescence microscopy

Hurry, Christian John, Zhang, Jinjie, Ishola, Olubukola, Slade, Emma, Nguyen, Cuong Q.

arXiv.org Artificial Intelligence 

Developing computer vision for high-content screening is challenging due to various sources of distribution-shift caused by changes in experimental conditions, perturbagens, and fluorescent markers. The impact of different sources of distribution-shift are confounded in typical evaluations of models based on transfer learning, which limits interpretations of how changes to model design and training affect generalisation. We propose an evaluation scheme that isolates sources of distribution-shift using the JUMP-CP dataset, allowing researchers to evaluate generalisation with respect to specific sources of distribution-shift. We then present a channel-agnostic masked autoencoder Campfire which, via a shared decoder for all channels, scales effectively to datasets containing many different fluorescent markers, and show that it generalises to out-of-distribution experimental batches, perturbagens, and fluorescent markers, and also demonstrates successful transfer learning from one cell type to another. Phenotypic drug discovery, in which cells or animal models are subject to a perturbation and monitored for a desired change in phenotype, has seen a resurgence due to its success in finding compounds that meet regulatory approval (Zheng et al., 2013; Boutros et al., 2015; Zanella et al., 2010). To quantify the effect of perturbations, it is common to use high content screening (HCS), a method in which batches of cells are stimulated with thousands of compounds in parallel, and multiple markers of changes in phenotype are measured simultaneously. In comparison with modalities based on sequencing technologies, imaging is more time-and cost-effective at scale and has been the main modality of HCS data. This necessitated the development of automated pipelines that extract biologically relevant features from cellular imaging data. Typically, this has involved traditional methods based on cell-segmentation and feature extraction and has been applied in various applications including protein sub-cellular localisation (P arnamaa & Parts, 2017), quantitative structure-activity relationship modelling (Nguyen et al., 2023) and identifying mechanism of action (D urr & Sick, 2016; Wong et al., 2023) and markers of drug resistance (Kelley et al., 2023).