Goto

Collaborating Authors

 Leng, Dawei


CCMB: A Large-scale Chinese Cross-modal Benchmark

arXiv.org Artificial Intelligence

Vision-language pre-training (VLP) on large-scale datasets has shown premier performance on various downstream tasks. In contrast to plenty of available benchmarks with English corpus, large-scale pre-training datasets and downstream datasets with Chinese corpus remain largely unexplored. In this work, we build a large-scale high-quality Chinese Cross-Modal Benchmark named CCMB for the research community, which contains the currently largest public pre-training dataset Zero and five human-annotated fine-tuning datasets for downstream tasks. Zero contains 250 million images paired with 750 million text descriptions, plus two of the five fine-tuning datasets are also currently the largest ones for Chinese cross-modal downstream tasks. Along with the CCMB, we also develop a VLP framework named R2D2, applying a pre-Ranking + Ranking strategy to learn powerful vision-language representations and a two-way distillation method (i.e., target-guided Distillation and feature-guided Distillation) to further enhance the learning capability. With the Zero and the R2D2 VLP framework, we achieve state-of-the-art performance on twelve downstream datasets from five broad categories of tasks including image-text retrieval, image-text matching, image caption, text-to-image generation, and zero-shot image classification. The datasets, models, and codes are available at https://github.com/yuxie11/R2D2


Bridge Diffusion Model: bridge non-English language-native text-to-image diffusion model with English communities

arXiv.org Artificial Intelligence

Text-to-Image generation (TTI) technologies are advancing rapidly, especially in the English language communities. However, English-native TTI models inherently carry biases from English world centric training data, which creates a dilemma for development of other language-native TTI models. One common choice is fine-tuning the English-native TTI model with translated samples from non-English communities. It falls short of fully addressing the model bias problem. Alternatively, training non-English language native models from scratch can effectively resolve the English world bias, but diverges from the English TTI communities, thus not able to utilize the strides continuously gaining in the English TTI communities any more. To build non-English language native TTI model meanwhile keep compatability with the English TTI communities, we propose a novel model structure referred as "Bridge Diffusion Model" (BDM). The proposed BDM employs a backbone-branch network structure to learn the non-English language semantics while keep the latent space compatible with the English-native TTI backbone, in an end-to-end manner. The unique advantages of the proposed BDM are that it's not only adept at generating images that precisely depict non-English language semantics, but also compatible with various English-native TTI plugins, such as different checkpoints, LoRA, ControlNet, Dreambooth, and Textual Inversion, etc. Moreover, BDM can concurrently generate content seamlessly combining both non-English native and English-native semantics within a single image, fostering cultural interaction. We verify our method by applying BDM to build a Chinese-native TTI model, whereas the method is generic and applicable to any other language.


Enhance Information Propagation for Graph Neural Network by Heterogeneous Aggregations

arXiv.org Artificial Intelligence

Success of deep learning in computer vision and natural language processing has recently boosted flood of research on applying neural networks to graph data (Wu et al., 2020). Graph is a simple yet versatile data structure jointly described by sets of nodes and edges. Aside from image and text data we're familiar, lots of real world data are better described as graph and thus processed by graph neural networks, such as social networks (Fan et al., 2019), financial fraud detection (Wang et al., 2020), knowledge graph (Zhang et al., 2020), biology interaction network (Higham et al., 2008), small molecule in drug discovery (Hu et al., 2019), to name a few. Since the seminal works (Kipf and Welling, 2016; Hamilton et al., 2017), tens of different graph neural network variants have been proposed, emphasizing different graph properties and design options. GNN research routes can be roughly divided into two categories: spectral based and spatial based. Spectral based GNNs try to approximate CNN's convolution by defining Fourier transform on graph (Kipf and Welling, 2016) and thus where the name graph convolution network comes from.


Heterogeneous Graph based Deep Learning for Biomedical Network Link Prediction

arXiv.org Artificial Intelligence

Multi-scale biomedical knowledge networks are expanding with emerging experimental technologies. Link prediction is increasingly used especially in bipartite biomedical networks. We propose a Graph Neural Networks (GNN) method, namely Graph Pair based Link Prediction model (GPLP), for predicting biomedical network links simply based on their topological interaction information. In GPLP, 1-hop subgraphs extracted from known network interaction matrix is learnt to predict missing links. To evaluate our method, three heterogeneous biomedical networks were used, i.e. Drug-Target Interaction network (DTI), Compound-Protein Interaction network (CPI) from NIH Tox21, and Compound-Virus Inhibition network (CVI). In 5-fold cross validation, our proposed GPLP method significantly outperforms over the state-of-the-art baselines. Besides, robustness is tested with different network incompleteness. Our method has the potential applications in other biomedical networks.