Goto

Collaborating Authors

 Translational Bioinformatics



Appendix ProteinShake: Building datasets and benchmarks for deep learning on protein structures

Neural Information Processing Systems

Table 3: Comparison of models trained with different representations of protein structure across various tasks, on a random data split . The optimal choice of representation depends on the task. Shown are mean and standard deviation across four runs with different seeds. Table 4: Comparison of models trained with different representations of protein structure across various tasks, on a sequence data split . Table 5: Comparison of models trained with different representations of protein structure across various tasks, on a structure data split .




Model Decides How to Tokenize: Adaptive DNA Sequence Tokenization with MxDNA

Neural Information Processing Systems

Foundation models have made significant strides in understanding the genomic language of DNA sequences. However, previous models typically adopt the tok-enization methods designed for natural language, which are unsuitable for DNA sequences due to their unique characteristics. In addition, the optimal approach to tokenize DNA remains largely under-explored, and may not be intuitively understood by humans even if discovered. To address these challenges, we introduce MxDNA, a novel framework where the model autonomously learns an effective DNA tokenization strategy through gradient decent.


Efficient and Effective Optimal Transport-Based Biclustering: Supplementary Material

Neural Information Processing Systems

Z that represents some transfer of mass between elements of w and v . The proof is the same for W . Proposition 2. Suppose that the target row and column representative distributions are the same, The the Kantorovich OT problem and whose rank is at most min(rank(Z), rank( W)) . Proof of proposition 2. From linear algebra, we have that Proof of proposition 3. We suppose that The optimal transport problem can be formulated and solved as the Earth Mover's Distance (EMD) We report the biclustering performance on the synthetic datasets in table 2. At least one of our models finds the perfect partition in all cases. The gene-expression matrices used are the Cumida Breast Cancer and Leukemia datasets. Their characteristics are shown in Table 3. Table 3: Characteristics of the gene expression datasets.