Yuan, Xinyu
Protein Structure Tokenization: Benchmarking and New Recipe
Yuan, Xinyu, Wang, Zichen, Collins, Marcus, Rangwala, Huzefa
Recent years have witnessed a surge in the development of protein structural tokenization methods, which chunk protein 3D structures into discrete or continuous representations. Structure tokenization enables the direct application of powerful techniques like language modeling for protein structures, and large multimodal models to integrate structures with protein sequences and functional texts. Despite the progress, the capabilities and limitations of these methods remain poorly understood due to the lack of a unified evaluation framework. We first introduce StructTokenBench, a framework that comprehensively evaluates the quality and efficiency of structure tokenizers, focusing on fine-grained local substructures rather than global structures, as typical in existing benchmarks. Our evaluations reveal that no single model dominates all benchmarking perspectives. Observations of codebook under-utilization led us to develop AminoAseed, a simple yet effective strategy that enhances codebook gradient updates and optimally balances codebook size and dimension for improved tokenizer utilization and quality. Compared to the leading model ESM3, our method achieves an average of 6.31% performance improvement across 24 supervised tasks, with sensitivity and utilization rates increased by 12.83% and 124.03%, respectively.
Learning-based Sketches for Frequency Estimation in Data Streams without Ground Truth
Yuan, Xinyu, Qiao, Yan, Li, Meng, Wei, Zhenchun, Feng, Cuiying
The frequency or volume estimation of unending data streams is a concern in many domains, starting with telecommunications but spreading to social networks, finance, and learning-augmented streaming algorithms [10-15] is receiving website engine. In network fields, for example, professionals significant attention due to the powerful potential of machine want to keep track of the activity frequency to identify overall learning (ML) to relieve or eliminate the binding of data network health and potential anomalies or changes in behavior, characteristics and the sketch design. Their typical workflow which, however, is often challenging because the amount of involves training a heavy hitter oracle, which receives a key information may be too large to store in an embedded device and returns a prediction of whether it will be heavy or not, then or to keep conveniently in fast storage [1]. As a consequence, inserts the most frequent keys into unique buckets and applies sketch, which is a set of counters or bitmaps associated with a sketch to the remaining keys. Although filtering heavy items hash functions, and a set of simple operations that record has been proven to improve the overall sketch performance on approximate information [2], has grown in popularity in the heavy-tailed distribution [4, 10], these offline and supervised context of high-velocity data streams and limited computational methods could hardly work in real-world applications.
Diffusion Models Meet Network Management: Improving Traffic Matrix Analysis with Diffusion-based Approach
Yuan, Xinyu, Qiao, Yan, Wei, Zhenchun, Zhang, Zeyu, Li, Minyue, Zhao, Pei, Hu, Rongyao, Li, Wenjing
Due to network operation and maintenance relying heavily on network traffic monitoring, traffic matrix analysis has been one of the most crucial issues for network management related tasks. However, it is challenging to reliably obtain the precise measurement in computer networks because of the high measurement cost, and the unavoidable transmission loss. Although some methods proposed in recent years allowed estimating network traffic from partial flow-level or link-level measurements, they often perform poorly for traffic matrix estimation nowadays. Despite strong assumptions like low-rank structure and the prior distribution, existing techniques are usually task-specific and tend to be significantly worse as modern network communication is extremely complicated and dynamic. To address the dilemma, this paper proposed a diffusion-based traffic matrix analysis framework named Diffusion-TM, which leverages problem-agnostic diffusion to notably elevate the estimation performance in both traffic distribution and accuracy. The novel framework not only takes advantage of the powerful generative ability of diffusion models to produce realistic network traffic, but also leverages the denoising process to unbiasedly estimate all end-to-end traffic in a plug-and-play manner under theoretical guarantee. Moreover, taking into account that compiling an intact traffic dataset is usually infeasible, we also propose a two-stage training scheme to make our framework be insensitive to missing values in the dataset. With extensive experiments with real-world datasets, we illustrate the effectiveness of Diffusion-TM on several tasks. Moreover, the results also demonstrate that our method can obtain promising results even with $5\%$ known values left in the datasets.
Traffic Matrix Estimation based on Denoising Diffusion Probabilistic Model
Yuan, Xinyu, Qiao, Yan, Zhao, Pei, Hu, Rongyao, Zhang, Benchu
The traffic matrix estimation (TME) problem has been widely researched for decades of years. Recent progresses in deep generative models offer new opportunities to tackle TME problems in a more advanced way. In this paper, we leverage the powerful ability of denoising diffusion probabilistic models (DDPMs) on distribution learning, and for the first time adopt DDPM to address the TME problem. To ensure a good performance of DDPM on learning the distributions of TMs, we design a preprocessing module to reduce the dimensions of TMs while keeping the data variety of each OD flow. To improve the estimation accuracy, we parameterize the noise factors in DDPM and transform the TME problem into a gradient-descent optimization problem. Finally, we compared our method with the state-of-the-art TME methods using two real-world TM datasets, the experimental results strongly demonstrate the superiority of our method on both TM synthesis and TM estimation.
Diffusion-TS: Interpretable Diffusion for General Time Series Generation
Yuan, Xinyu, Qiao, Yan
Denoising diffusion probabilistic models (DDPMs) are becoming the leading paradigm for generative models. It has recently shown breakthroughs in audio synthesis, time series imputation and forecasting. In this paper, we propose Diffusion-TS, a novel diffusion-based framework that generates multivariate time series samples of high quality by using an encoder-decoder transformer with disentangled temporal representations, in which the decomposition technique guides Diffusion-TS to capture the semantic meaning of time series while transformers mine detailed sequential information from the noisy model input. Different from existing diffusion-based approaches, we train the model to directly reconstruct the sample instead of the noise in each diffusion step, combining a Fourier-based loss term. Diffusion-TS is expected to generate time series satisfying both interpretablity and realness. In addition, it is shown that the proposed Diffusion-TS can be easily extended to conditional generation tasks, such as forecasting and imputation, without any model changes. This also motivates us to further explore the performance of Diffusion-TS under irregular settings. Finally, through qualitative and quantitative experiments, results show that Diffusion-TS achieves the state-of-the-art results on various realistic analyses of time series. Time series is ubiquitous in real-world problems, playing a crucial component in a wide variety of domains such as finance, medicine, biology, retail, and climate modeling (Lim & Zohren, 2021). However, lack of access to these dynamical data is a key hindrance to the development of machine learning solutions in some cases where data sharing may lead to privacy breaches (Alaa et al., 2021). Synthesizing realistic time series data is viewed as a promising solution and has received increasing attention driven by advances in deep learning. With perceptual qualities superior to GANs while avoiding the optimization challenges of adversarial training, score-based diffusion models (Song et al., 2021; 2020), especially denoising diffusion probabilistic models (DDPMs) (Ho et al., 2020), have taken the world of image, video, and text generation (Ho et al., 2022; Li et al., 2022a; Dhariwal & Nichol, 2021; Harvey et al., 2022) by storm than ever before.
A*Net: A Scalable Path-based Reasoning Approach for Knowledge Graphs
Zhu, Zhaocheng, Yuan, Xinyu, Galkin, Mikhail, Xhonneux, Sophie, Zhang, Ming, Gazeau, Maxime, Tang, Jian
Reasoning on large-scale knowledge graphs has been long dominated by embedding methods. While path-based methods possess the inductive capacity that embeddings lack, their scalability is limited by the exponential number of paths. Here we present A*Net, a scalable path-based method for knowledge graph reasoning. Inspired by the A* algorithm for shortest path problems, our A*Net learns a priority function to select important nodes and edges at each iteration, to reduce time and memory footprint for both training and inference. The ratio of selected nodes and edges can be specified to trade off between performance and efficiency. Experiments on both transductive and inductive knowledge graph reasoning benchmarks show that A*Net achieves competitive performance with existing state-of-the-art path-based methods, while merely visiting 10% nodes and 10% edges at each iteration. On a million-scale dataset ogbl-wikikg2, A*Net not only achieves a new state-of-the-art result, but also converges faster than embedding methods. A*Net is the first path-based method for knowledge graph reasoning at such scale.
Towards Foundation Models for Knowledge Graph Reasoning
Galkin, Mikhail, Yuan, Xinyu, Mostafa, Hesham, Tang, Jian, Zhu, Zhaocheng
Foundation models in language and vision have the ability to run inference on any textual and visual inputs thanks to the transferable representations such as a vocabulary of tokens in language. Knowledge graphs (KGs) have different entity and relation vocabularies that generally do not overlap. The key challenge of designing foundation models on KGs is to learn such transferable representations that enable inference on any graph with arbitrary entity and relation vocabularies. In this work, we make a step towards such foundation models and present ULTRA, an approach for learning universal and transferable graph representations. ULTRA builds relational representations as a function conditioned on their interactions. Such a conditioning strategy allows a pre-trained ULTRA model to inductively generalize to any unseen KG with any relation vocabulary and to be fine-tuned on any graph. Conducting link prediction experiments on 57 different KGs, we find that the zero-shot inductive inference performance of a single pre-trained ULTRA model on unseen graphs of various sizes is often on par or better than strong baselines trained on specific graphs. Fine-tuning further boosts the performance.
ProtST: Multi-Modality Learning of Protein Sequences and Biomedical Texts
Xu, Minghao, Yuan, Xinyu, Miret, Santiago, Tang, Jian
Current protein language models (PLMs) learn protein representations mainly based on their sequences, thereby well capturing co-evolutionary information, but they are unable to explicitly acquire protein functions, which is the end goal of protein representation learning. Fortunately, for many proteins, their textual property descriptions are available, where their various functions are also described. Motivated by this fact, we first build the ProtDescribe dataset to augment protein sequences with text descriptions of their functions and other important properties. Based on this dataset, we propose the ProtST framework to enhance Protein Sequence pre-training and understanding by biomedical Texts. During pre-training, we design three types of tasks, i.e., unimodal mask prediction, multimodal representation alignment and multimodal mask prediction, to enhance a PLM with protein property information with different granularities and, at the same time, preserve the PLM's original representation power. On downstream tasks, ProtST enables both supervised learning and zero-shot prediction. We verify the superiority of ProtST-induced PLMs over previous ones on diverse representation learning benchmarks. Under the zero-shot setting, we show the effectiveness of ProtST on zero-shot protein classification, and ProtST also enables functional protein retrieval from a large-scale database without any function annotation.