Wang, Yadong
MSNGO: multi-species protein function annotation based on 3D protein structure and network propagation
Wang, Beibei, Cui, Boyue, Chen, Shiqu, Wang, Xuan, Wang, Yadong, Li, Junyi
Motivation: In recent years, protein function prediction has broken through the bottleneck of sequence features, significantly improving prediction accuracy using high-precision protein structures predicted by AlphaFold2. While single-species protein function prediction methods have achieved remarkable success, multi-species protein function prediction methods are still in the stage of using PPI networks and sequence features. Providing effective cross-species label propagation for species with sparse protein annotations remains a challenging issue. To address this problem, we propose the MSNGO model, which integrates structural features and network propagation methods. Our validation shows that using structural features can significantly improve the accuracy of multi-species protein function prediction. Results: We employ graph representation learning techniques to extract amino acid representations from protein structure contact maps and train a structural model using a graph convolution pooling module to derive protein-level structural features. After incorporating the sequence features from ESM-2, we apply a network propagation algorithm to aggregate information and update node representations within a heterogeneous network. The results demonstrate that MSNGO outperforms previous multi-species protein function prediction methods that rely on sequence features and PPI networks. Availability: https://github.com/blingbell/MSNGO.
Combinatorial Testing for Deep Learning Systems
Ma, Lei, Zhang, Fuyuan, Xue, Minhui, Li, Bo, Liu, Yang, Zhao, Jianjun, Wang, Yadong
Deep learning (DL) has achieved remarkable progress over the past decade and been widely applied to many safety-critical applications. However, the robustness of DL systems recently receives great concerns, such as adversarial examples against computer vision systems, which could potentially result in severe consequences. Adopting testing techniques could help to evaluate the robustness of a DL system and therefore detect vulnerabilities at an early stage. The main challenge of testing such systems is that its runtime state space is too large: if we view each neuron as a runtime state for DL, then a DL system often contains massive states, rendering testing each state almost impossible. For traditional software, combinatorial testing (CT) is an effective testing technique to reduce the testing space while obtaining relatively high defect detection abilities. In this paper, we perform an exploratory study of CT on DL systems. We adapt the concept in CT and propose a set of coverage criteria for DL systems, as well as a CT coverage guided test generation technique. Our evaluation demonstrates that CT provides a promising avenue for testing DL systems. We further pose several open questions and interesting directions for combinatorial testing of DL systems.
DeepGauge: Comprehensive and Multi-Granularity Testing Criteria for Gauging the Robustness of Deep Learning Systems
Ma, Lei, Juefei-Xu, Felix, Sun, Jiyuan, Chen, Chunyang, Su, Ting, Zhang, Fuyuan, Xue, Minhui, Li, Bo, Li, Li, Liu, Yang, Zhao, Jianjun, Wang, Yadong
Deep learning defines a new data-driven programming paradigm that constructs the internal system logic of a crafted neuron network through a set of training data. Deep learning (DL) has been widely adopted in many safety-critical scenarios. However, a plethora of studies have shown that the state-of-the-art DL systems suffer from various vulnerabilities which can lead to severe consequences when applied to real-world applications. Currently, the robustness of a DL system against adversarial attacks is usually measured by the accuracy of test data. Considering the limitation of accessible test data, good performance on test data can hardly guarantee the robustness and generality of DL systems. Different from traditional software systems which have clear and controllable logic and functionality, a DL system is trained with data and lacks thorough understanding. This makes it difficult for system analysis and defect detection, which could potentially hinder its real-world deployment without safety guarantees. In this paper, we propose DeepGauge, a comprehensive and multi-granularity testing criteria for DL systems, which renders a complete and multi-faceted portrayal of the testbed. The in-depth evaluation of our proposed testing criteria is demonstrated on two well-known datasets, five DL systems, with four state-of-the-art adversarial data generation techniques. The effectiveness of DeepGauge sheds light on the construction of robust DL systems.
Comparison among dimensionality reduction techniques based on Random Projection for cancer classification
Xie, Haozhe, Li, Jie, Zhang, Qiaosheng, Wang, Yadong
Random Projection (RP) technique has been widely applied in many scenarios because it can reduce high-dimensional features into low-dimensional space within short time and meet the need of real-time analysis of massive data. There is an urgent need of dimensionality reduction with fast increase of big genomics data. However, the performance of RP is usually lower. We attempt to improve classification accuracy of RP through combining other reduction dimension methods such as Principle Component Analysis (PCA), Linear Discriminant Analysis (LDA), and Feature Selection (FS). We compared classification accuracy and running time of different combination methods on three microarray datasets and a simulation dataset. Experimental results show a remarkable improvement of 14.77% in classification accuracy of FS followed by RP compared to RP on BC-TCGA dataset. LDA followed by RP also helps RP to yield a more discriminative subspace with an increase of 13.65% on classification accuracy on the same dataset. FS followed by RP outperforms other combination methods in classification accuracy on most of the datasets.