Goto

Collaborating Authors

 Zou, Quan


SBSM-Pro: Support Bio-sequence Machine for Proteins

arXiv.org Artificial Intelligence

Bio-sequences, which include DNA, RNA, and proteins, are the molecular foundation of modern genetic research. The classification of bio-sequences based on sequence information has been a key focus in bioinformatics research. At present, with the sequential completion of genome mapping from humans to various species, we have amassed a vast amount of sequence data, creating an urgent need for computer-assisted annotation of sequence functions. Although it is statistically evident that genetic sequences determine hereditary diseases, the mechanisms by which sequence variations contribute to diseases are intricately complex. It is difficult to address and interpret all these issues through one biological experiment; hence, multiple computer predictions are needed to guide the progression of wet lab exploration. In summary, the application of information science and machine learning to bio-sequence classification is a valuable tool for assisting researchers in comprehending and analysing bio-sequences. It serves as a key driving force for advancing research in the field of bioinformatics. In the field of bio-sequence classification, machine learning methods are broadly pursued using two strategies: feature extraction combined with traditional classification methods and direct sequence classification via deep learning techniques. For bio-sequences, relevant features are mainly characterized as frequency, physicochemical, structural, and evolutionary features.


MechRetro is a chemical-mechanism-driven graph learning framework for interpretable retrosynthesis prediction and pathway planning

arXiv.org Artificial Intelligence

Leveraging artificial intelligence for automatic retrosynthesis speeds up organic pathway planning in digital laboratories. However, existing deep learning approaches are unexplainable, like "black box" with few insights, notably limiting their applications in real retrosynthesis scenarios. Here, we propose MechRetro, a chemical-mechanism-driven graph learning framework for interpretable retrosynthetic prediction and pathway planning, which learns several retrosynthetic actions to simulate a reverse reaction via elaborate self-adaptive joint learning. By integrating chemical knowledge as prior information, we design a novel Graph Transformer architecture to adaptively learn discriminative and chemically meaningful molecule representations, highlighting the strong capacity in molecule feature representation learning. We demonstrate that MechRetro outperforms the state-of-the-art approaches for retrosynthetic prediction with a large margin on large-scale benchmark datasets. Extending MechRetro to the multi-step retrosynthesis analysis, we identify efficient synthetic routes via an interpretable reasoning mechanism, leading to a better understanding in the realm of knowledgeable synthetic chemists. We also showcase that MechRetro discovers a novel pathway for protokylol, along with energy scores for uncertainty assessment, broadening the applicability for practical scenarios. Overall, we expect MechRetro to provide meaningful insights for high-throughput automated organic synthesis in drug discovery.


Reference-Based Sequence Classification

arXiv.org Machine Learning

Sequence classification is an important data mining task in many real world applications. Over the past few decades, many sequence classification methods have been proposed from different aspects. In particular, the pattern-based method is one of the most important and widely studied sequence classification methods in the literature. In this paper, we present a reference-based sequence classification framework, which can unify existing pattern-based sequence classification methods under the same umbrella. More importantly, this framework can be used as a general platform for developing new sequence classification algorithms. By utilizing this framework as a tool, we propose new sequence classification algorithms that are quite different from existing solutions. Experimental results show that new methods developed under the proposed framework are capable of achieving comparable classification accuracy to those state-of-the-art sequence classification algorithms.


Instance-Based Classification through Hypothesis Testing

arXiv.org Machine Learning

Classification is a fundamental problem in machine learning and data mining. During the past decades, numerous classification methods have been presented based on different principles. However, most existing classifiers cast the classification problem as an optimization problem and do not address the issue of statistical significance. In this paper, we formulate the binary classification problem as a two-sample testing problem. More precisely, our classification model is a generic framework that is composed of two steps. In the first step, the distance between the test instance and each training instance is calculated to derive two distance sets. In the second step, the two-sample test is performed under the null hypothesis that the two sets of distances are drawn from the same cumulative distribution. After these two steps, we have two p-values for each test instance and the test instance is assigned to the class associated with the smaller p-value. Essentially, the presented classification method can be regarded as an instance-based classifier based on hypothesis testing. The experimental results on 40 real data sets show that our method is able to achieve the same level performance as the state-of-the-art classifiers and has significantly better performance than existing testing-based classifiers. Furthermore, we can handle outlying instances and control the false discovery rate of test instances assigned to each class under the same framework.