Scalable semi-supervised dimensionality reduction with GPU-accelerated EmbedSOM

Šmelko, Adam, Molnárová, Soňa, Kratochvíl, Miroslav, Koladiya, Abhishek, Musil, Jan, Kruliš, Martin, Vondrášek, Jiří

arXiv.org Machine Learning 

Abstract: Dimensionality reduction methods have found vast application as visualization tools in diverse areas of science. Although many different methods exist, their performance is often insufficient for providing quick insight into many contemporary datasets, and the unsupervised mode of use prevents the users from utilizing the methods for dataset exploration and finetuning the details for improved visualization quality. BlosSOM builds on a GPUaccelerated implementation of the EmbedSOM algorithm, complemented by several landmarkbased algorithms for interfacing the unsupervised model learning algorithms with the user supervision. We show the application of BlosSOM on realistic datasets, where it helps to produce high-quality visualizations that incorporate user-specified layout and focus on certain features. We believe the semi-supervised dimensionality reduction will improve the data visualization possibilities for science areas such as single-cell cytometry, and provide a fast and efficient base methodology for new directions in dataset exploration and annotation. Dimensionality reduction algorithms emerged as indispensable utilities that enable various forms of intuitive data visualization, providing insight that in turn simplifies rigorous data analysis. Various algorithms have been proposed for graphs and high-dimensional point-cloud data, and many different types of datasets that can be represented with a graph structure or embedded into vector spaces. Performance of the non-linear dimensionality reduction algorithms becomes a concern if the analysis pipeline is required to scale or when the results are required in a limited amount of time such as in clinical settings. The most popular methods, typically based on neighborhood embedding computed by stochastic descent, force-based layouting or neural autoencoders, reach applicability limits when the dataset size is too large. To tackle the limitations, we have previously developed EmbedSOM [15], a dimensionality reduction and visualization algorithm based on self-organizing maps (SOMs) [13]. EmbedSOM provided an order-of-magnitude speedup on datasets typical for the single-cell cytometry data visualization while retaining competitive quality of the results. The concept has proven useful for interactive and high-performance workflows in cytometry [16, 14], and easily applies to many other types of datasets.