Lyu, Qi
Deep Learning for Genomics: A Concise Overview
Yue, Tianwei, Wang, Yuanxin, Zhang, Longxiang, Gu, Chunming, Xue, Haoru, Wang, Wenping, Lyu, Qi, Dun, Yujie
Advancements in genomic research such as high-throughput sequencing techniques have driven modern genomic studies into "big data" disciplines. This data explosion is constantly challenging conventional methods used in genomics. In parallel with the urgent demand for robust algorithms, deep learning has succeeded in a variety of fields such as vision, speech, and text processing. Yet genomics entails unique challenges to deep learning since we are expecting from deep learning a superhuman intelligence that explores beyond our knowledge to interpret the genome. A powerful deep learning model should rely on insightful utilization of task-specific knowledge. In this paper, we briefly discuss the strengths of different deep learning models from a genomic perspective so as to fit each particular task with a proper deep architecture, and remark on practical considerations of developing modern deep learning architectures for genomics. We also provide a concise review of deep learning applications in various aspects of genomic research, as well as pointing out potential opportunities and obstacles for future genomics applications.
On Finite-Sample Identifiability of Contrastive Learning-Based Nonlinear Independent Component Analysis
Lyu, Qi, Fu, Xiao
Nonlinear independent component analysis (nICA) aims at recovering statistically independent latent components that are mixed by unknown nonlinear functions. Central to nICA is the identifiability of the latent components, which had been elusive until very recently. Specifically, Hyv\"arinen et al. have shown that the nonlinearly mixed latent components are identifiable (up to often inconsequential ambiguities) under a generalized contrastive learning (GCL) formulation, given that the latent components are independent conditioned on a certain auxiliary variable. The GCL-based identifiability of nICA is elegant, and establishes interesting connections between nICA and popular unsupervised/self-supervised learning paradigms in representation learning, causal learning, and factor disentanglement. However, existing identifiability analyses of nICA all build upon an unlimited sample assumption and the use of ideal universal function learners -- which creates a non-negligible gap between theory and practice. Closing the gap is a nontrivial challenge, as there is a lack of established ``textbook'' routine for finite sample analysis of such unsupervised problems. This work puts forth a finite-sample identifiability analysis of GCL-based nICA. Our analytical framework judiciously combines the properties of the GCL loss function, statistical generalization analysis, and numerical differentiation. Our framework also takes the learning function's approximation error into consideration, and reveals an intuitive trade-off between the complexity and expressiveness of the employed function learner. Numerical experiments are used to validate the theorems.
Latent Correlation-Based Multiview Learning and Self-Supervision: A Unifying Perspective
Lyu, Qi, Fu, Xiao, Wang, Weiran, Lu, Songtao
Multiple views of data, both naturally acquired (e.g., image and audio) and artificially produced (e.g., via adding different noise to data samples), have proven useful in enhancing representation learning. Natural views are often handled by multiview analysis tools, e.g., (deep) canonical correlation analysis [(D)CCA], while the artificial ones are frequently used in self-supervised learning (SSL) paradigms, e.g., SimCLR and Barlow Twins. Both types of approaches often involve learning neural feature extractors such that the embeddings of data exhibit high cross-view correlations. Although intuitive, the effectiveness of correlation-based neural embedding is only empirically validated. This work puts forth a theory-backed framework for unsupervised multiview learning. Our development starts with proposing a multiview model, where each view is a nonlinear mixture of shared and private components. Consequently, the learning problem boils down to shared/private component identification and disentanglement. Under this model, latent correlation maximization is shown to guarantee the extraction of the shared components across views (up to certain ambiguities). In addition, the private information in each view can be provably disentangled from the shared using proper regularization design. The method is tested on a series of tasks, e.g., downstream clustering, which all show promising performance. Our development also provides a unifying perspective for understanding various DCCA and SSL schemes.
Identifiability-Guaranteed Simplex-Structured Post-Nonlinear Mixture Learning via Autoencoder
Lyu, Qi, Fu, Xiao
Unsupervised mixture learning (UML) aims at unraveling the aggregated and entangled underlying latent components from ambient data, without using any training samples. This task is also known as blind source separation (BSS) and factor analysis in the literature [1]. UML has a long history in the signal processing and machine learning communities; see, e.g., the early seminal work of independent component analysis (ICA) [1]. Many important applications can be considered as a UML problem, e.g., audio/speech separation [2], EEG signal denoising [3], image representation learning [4], hyperspectral unmixing [5], and topic mining [6], just to name a few. One of the arguably most important aspects in UML/BSS is the so-called identifiability problem--is it possible to identify the mixed latent components from the mixtures in an unsupervised manner? The UML problem is often ill-posed, since an arbitrary number of solutions exist in general; see, e.g., discussions in [1, 7]. To establish identifiability, one may exploit prior knowledge of the mixing process and/or the latent components. Various frameworks were proposed for unraveling linearly mixed latent components by exploiting their properties, e.g., statistical independence, nonnegativity, boundedness, sparsity, and simplex structure--which leads to many well-known unsupervised learning models, i.e., ICA [1], nonnegative matrix factorization (NMF) [7], bounded component analysis (BCA) [8], sparse component analysis (SCA) [9], and simplex-structured matrix factorization (SSMF) [2, 6, 10]. These structures often stem from physical meaning of their respective engineering problems.
Neural Network-Assisted Nonlinear Multiview Component Analysis: Identifiability and Algorithm
Lyu, Qi, Fu, Xiao
--Multiview analysis aims at extracting shared latent components from data samples that are acquired in different domains, e.g., image, text, and audio. Classic multiview analysis, e.g., canonical correlation analysis (CCA), tackles this problem via matching the linearly transformed views in a certain latent domain. More recently, powerful nonlinear learning tools such as kernel methods and neural networks are utilized for enhancing the classic CCA. However, unlike linear CCA whose theoretical aspects are clearly understood, nonlinear CCA approaches are largely intuition-driven. In particular, it is unclear under what conditions the shared latent components across the veiws can be identified--while identifiability plays an essential role in many applications. In this work, we revisit nonlinear multiview analysis and address both the theoretical and computational aspects. We take a nonlinear multiview mixture learning viewpoint, which is a natural extension of the classic generative models for linear CCA. From there, we derive a nonlinear multiview analysis criteron. We show that minimizing this criterion leads to identification of the latent shared components up to certain ambiguities, under reasonable conditions. On the computation side, we propose an effective algorithm with simple and scalable update rules. A series of simulations and real-data experiments corroborate our theoretical analysis. Multiview analysis has been an indispensable tool in statistical signal processing, machine learning, and data analytics. In the context of multiview learning, a view can be understood as measurements of data entities (e.g., a cat) in a certain domain (e.g., text, image, and audio). Most data entities naturally appear in different domains. Multiview analysis aims at extracting essential and common information from different views. Compared with single-view analysis tools like principal component analysis (PCA), independent component analysis (ICA) [1], and nonnegative matrix factorization (NMF) [2], multiview analysis tools such as canonical correlation analysis (CCA) [3] have an array of unique features. For example, CCA has been shown to be more robust to noise and view-specific strong interference [4], [5]. The classic CCA has been extensively studied in the literature, ever since its proposal in statistics in the 1930s [3], [6]. Q. Lyu and X. Fu are with the School of Electrical Engineering and Computer Science, Oregon State University, Corvallis, OR 97331, United States. The transformations are supposed to'project' the views to a domain where the views share similar representations. Interestingly, the formulated optimization problem, although being nonconvex, can be recast into a generalized eigende-composition problem and solved efficiently [3], [7].
WristAuthen: A Dynamic Time Wrapping Approach for User Authentication by Hand-Interaction through Wrist-Worn Devices
Lyu, Qi, Kong, Zhifeng, Shen, Chao, Yue, Tianwei
The growing trend of using wearable devices for context-aware computing and pervasive sensing systems has raised its potentials for quick and reliable authentication techniques. Since personal writing habitats differ from each other, it is possible to realize user authentication through writing. This is of great significance as sensible information is easily collected by these devices. This paper presents a novel user authentication system through wrist-worn devices by analyzing the interaction behavior with users, which is both accurate and efficient for future usage. The key feature of our approach lies in using much more effective Savitzky-Golay filter and Dynamic Time Wrapping method to obtain fine-grained writing metrics for user authentication. These new metrics are relatively unique from person to person and independent of the computing platform. Analyses are conducted on the wristband-interaction data collected from 50 users with diversity in gender, age, and height. Extensive experimental results show that the proposed approach can identify users in a timely and accurate manner, with a false-negative rate of 1.78\%, false-positive rate of 6.7\%, and Area Under ROC Curve of 0.983 . Additional examination on robustness to various mimic attacks, tolerance to training data, and comparisons to further analyze the applicability.