SCANPY : large-scale single-cell gene expression data analysis


With SCANPY, we introduce the class ANNDATA--with a corresponding package ANNDATA--which stores a data matrix with the most general annotations possible: annotations of observations (samples, cells) and variables (features, genes), and unstructured annotations. As SCANPY is built around that class, it is easy to add new functionality to the toolkit. All statistics and machine-learning tools extract information from a data matrix, which can be added to an ANNDATA object while leaving the structure of ANNDATA unaffected. ANNDATA is similar to R's EXPRESSIONSET [26], but supports sparse data and allows HDF5-based backing of ANNDATA objects on disk, a format independent of platform, framework, and language. This allows operating on an ANNDATA object without fully loading it into memory--the functionality is offered via ANNDATA's backed mode as opposed to its memory mode. To simplify memory-efficient pipelines, SCANPY's functions operate in-place by default but allow the optional non-destructive transformation of objects. Pipelines written this way can then also be run in backed mode to exploit online-learning formulations of algorithms.