Yang, Hongxia
Contrastive Conditional Transport for Representation Learning
Zheng, Huangjie, Chen, Xu, Yao, Jiangchao, Yang, Hongxia, Li, Chunyuan, Zhang, Ya, Zhang, Hao, Tsang, Ivor, Zhou, Jingren, Zhou, Mingyuan
The classical contrastive loss (Oord et al., 2018; Poole et al., 2018) has achieved remarkable success in representation learning, benefiting downstream tasks in a variety of areas (Misra & Maaten, 2020; He et al., 2020; Chen et al., 2020a; Fang & Xie, 2020; Giorgi et al., 2020). The intuition of the contrastive loss is that given a query, its positive sample needs to be close, while the negative samples need to be far away in the representation space, for which the unit hypersphere is the most common assumption (Wang et al., 2017; Davidson et al., 2018). This learning scheme encourages the encoder to learn representations that are invariant to unnecessary details, and uniformly distributed on the hypersphere to maximally preserve relevant information (Hjelm et al., 2018; Tian et al., 2019; Bachman et al., 2019; Wang & Isola, 2020). A notable concern of the conventional contrastive loss is that the query's positive and negative samples are often uniformly sampled and equally treated in the comparison, which results in an inefficient estimation and limits the performance of learned representations (Saunshi et al., 2019b; Chuang et al., 2020). As illustrated in Figure 1, given a query, the conventional CL methods usually randomly take one positive sample to form the positive pair and equally treat all the other negative pairs, regardless of how informative a sample is to the query.
Device-Cloud Collaborative Learning for Recommendation
Yao, Jiangchao, Wang, Feng, Jia, KunYang, Han, Bo, Zhou, Jingren, Yang, Hongxia
With the rapid development of storage and computing power on mobile devices, it becomes critical and popular to deploy models on devices to save onerous communication latencies and to capture real-time features. While quite a lot of works have explored to facilitate on-device learning and inference, most of them focus on dealing with response delay or privacy protection. Little has been done to model the collaboration between the device and the cloud modeling and benefit both sides jointly. To bridge this gap, we are among the first attempts to study the Device-Cloud Collaborative Learning (DCCL) framework. Specifically, we propose a novel MetaPatch learning approach on the device side to efficiently achieve "thousands of people with thousands of models" given a centralized cloud model. Then, with billions of updated personalized device models, we propose a "model-over-models" distillation algorithm, namely MoMoDistill, to update the centralized cloud model. Our extensive experiments over a range of datasets with different settings demonstrate the effectiveness of such collaboration on both cloud and device sides, especially its superiority in modeling long-tailed users.
Controllable Generation from Pre-trained Language Models via Inverse Prompting
Zou, Xu, Yin, Da, Zhong, Qingyang, Yang, Hongxia, Yang, Zhilin, Tang, Jie
Large-scale pre-trained language models have demonstrated strong capabilities of generating realistic text. However, it remains challenging to control the generation results. Previous approaches such as prompting are far from sufficient, which limits the usage of language models. To tackle this challenge, we propose an innovative method, inverse prompting, to better control text generation. The core idea of inverse prompting is to use generated text to inversely predict the prompt during beam search, which enhances the relevance between the prompt and the generated text and provides better controllability. Empirically, we pre-train a large-scale Chinese language model to perform a systematic study using human evaluation on the tasks of open-domain poem generation and open-domain long-form question answering. Our results show that our proposed method substantially outperforms the baselines and that our generation quality is close to human performance on some of the tasks. Narrators can try our poem generation demo at https://pretrain.aminer.cn/apps/poetry.html, while our QA demo can be found at https://pretrain.aminer.cn/app/qa. For researchers, the code is provided in https://github.com/THUDM/InversePrompting.
CogDL: An Extensive Toolkit for Deep Learning on Graphs
Cen, Yukuo, Hou, Zhenyu, Wang, Yan, Chen, Qibin, Luo, Yizhen, Yao, Xingcheng, Zeng, Aohan, Guo, Shiguang, Zhang, Peng, Dai, Guohao, Wang, Yu, Zhou, Chang, Yang, Hongxia, Tang, Jie
Graph representation learning aims to learn low-dimensional node embeddings for graphs. It is used in several real-world applications such as social network analysis and large-scale recommender systems. In this paper, we introduce CogDL, an extensive research toolkit for deep learning on graphs that allows researchers and developers to easily conduct experiments and build applications. It provides standard training and evaluation for the most important tasks in the graph domain, including node classification, link prediction, graph classification, and other graph tasks. For each task, it offers implementations of state-of-the-art models. The models in our toolkit are divided into two major parts, graph embedding methods and graph neural networks. Most of the graph embedding methods learn node-level or graph-level representations in an unsupervised way and preserves the graph properties such as structural information, while graph neural networks capture node features and work in semi-supervised or self-supervised settings. All models implemented in our toolkit can be easily reproducible for leaderboard results. Most models in CogDL are developed on top of PyTorch, and users can leverage the advantages of PyTorch to implement their own models. Furthermore, we demonstrate the effectiveness of CogDL for real-world applications in AMiner, which is a large academic database and system.
Inductive Granger Causal Modeling for Multivariate Time Series
Chu, Yunfei, Wang, Xiaowei, Ma, Jianxin, Jia, Kunyang, Zhou, Jingren, Yang, Hongxia
Granger causal modeling is an emerging topic that can uncover Granger causal relationship behind multivariate time series data. In many real-world systems, it is common to encounter a large amount of multivariate time series data collected from different individuals with sharing commonalities. However, there are ongoing concerns regarding Granger causality's applicability in such large scale complex scenarios, presenting both challenges and opportunities for Granger causal structure reconstruction. Existing methods usually train a distinct model for each individual, suffering from inefficiency and over-fitting issues. To bridge this gap, we propose an Inductive GRanger cAusal modeling (InGRA) framework for inductive Granger causality learning and common causal structure detection on multivariate time series, which exploits the shared commonalities underlying the different individuals. In particular, we train one global model for individuals with different Granger causal structures through a novel attention mechanism, called prototypical Granger causal attention. The model can detect common causal structures for different individuals and infer Granger causal structures for newly arrived individuals. Extensive experiments, as well as an online A/B test on an E-commercial advertising platform, demonstrate the superior performances of InGRA.
DVE: Dynamic Variational Embeddings with Applications in Recommender Systems
Liu, Meimei, Yang, Hongxia
Embedding is a useful technique to project a high-dimensional feature into a low-dimensional space, and it has many successful applications including link prediction, node classification and natural language processing. Current approaches mainly focus on static data, which usually lead to unsatisfactory performance in applications involving large changes over time. How to dynamically characterize the variation of the embedded features is still largely unexplored. In this paper, we introduce a dynamic variational embedding (DVE) approach for sequence-aware data based on recent advances in recurrent neural networks. DVE can model the node's intrinsic nature and temporal variation explicitly and simultaneously, which are crucial for exploration. We further apply DVE to sequence-aware recommender systems, and develop an end-to-end neural architecture for link prediction.
Controllable Multi-Interest Framework for Recommendation
Cen, Yukuo, Zhang, Jianwei, Zou, Xu, Zhou, Chang, Yang, Hongxia, Tang, Jie
Recently, neural networks have been widely used in e-commerce recommender systems, owing to the rapid development of deep learning. We formalize the recommender system as a sequential recommendation problem, intending to predict the next items that the user might be interacted with. Recent works usually give an overall embedding from a user's behavior sequence. However, a unified user embedding cannot reflect the user's multiple interests during a period. In this paper, we propose a novel controllable multi-interest framework for the sequential recommendation, called ComiRec. Our multi-interest module captures multiple interests from user behavior sequences, which can be exploited for retrieving candidate items from the large-scale item pool. These items are then fed into an aggregation module to obtain the overall recommendation. The aggregation module leverages a controllable factor to balance the recommendation accuracy and diversity. We conduct experiments for the sequential recommendation on two real-world datasets, Amazon and Taobao. Experimental results demonstrate that our framework achieves significant improvements over state-of-the-art models. Our framework has also been successfully deployed on the offline Alibaba distributed cloud platform.
GCC: Graph Contrastive Coding for Graph Neural Network Pre-Training
Qiu, Jiezhong, Chen, Qibin, Dong, Yuxiao, Zhang, Jing, Yang, Hongxia, Ding, Ming, Wang, Kuansan, Tang, Jie
Graph representation learning has emerged as a powerful technique for addressing real-world problems. Various downstream graph learning tasks have benefited from its recent developments, such as node classification, similarity search, and graph classification. However, prior arts on graph representation learning focus on domain specific problems and train a dedicated model for each graph dataset, which is usually non-transferable to out-of-domain data. Inspired by the recent advances in pre-training from natural language processing and computer vision, we design Graph Contrastive Coding (GCC) -- a self-supervised graph neural network pre-training framework -- to capture the universal network topological properties across multiple networks. We design GCC's pre-training task as subgraph instance discrimination in and across networks and leverage contrastive learning to empower graph neural networks to learn the intrinsic and transferable structural representations. We conduct extensive experiments on three graph learning tasks and ten graph datasets. The results show that GCC pre-trained on a collection of diverse datasets can achieve competitive or better performance to its task-specific and trained-from-scratch counterparts. This suggests that the pre-training and fine-tuning paradigm presents great potential for graph representation learning.
Understanding Negative Sampling in Graph Representation Learning
Yang, Zhen, Ding, Ming, Zhou, Chang, Yang, Hongxia, Zhou, Jingren, Tang, Jie
Graph representation learning has been extensively studied in recent years. Despite its potential in generating continuous embeddings for various networks, both the effectiveness and efficiency to infer high-quality representations toward large corpus of nodes are still challenging. Sampling is a critical point to achieve the performance goals. Prior arts usually focus on sampling positive node pairs, while the strategy for negative sampling is left insufficiently explored. To bridge the gap, we systematically analyze the role of negative sampling from the perspectives of both objective and risk, theoretically demonstrating that negative sampling is as important as positive sampling in determining the optimization objective and the resulted variance. To the best of our knowledge, we are the first to derive the theory and quantify that the negative sampling distribution should be positively but sub-linearly correlated to their positive sampling distribution. With the guidance of the theory, we propose MCNS, approximating the positive distribution with self-contrast approximation and accelerating negative sampling by Metropolis-Hastings. We evaluate our method on 5 datasets that cover extensive downstream graph learning tasks, including link prediction, node classification and personalized recommendation, on a total of 19 experimental settings. These relatively comprehensive experimental results demonstrate its robustness and superiorities.
Contrastive Learning for Debiased Candidate Generation in Large-Scale Recommender Systems
Zhou, Chang, Ma, Jianxin, Zhang, Jianwei, Zhou, Jingren, Yang, Hongxia
Deep candidate generation (DCG) that narrows down the collection of relevant items from billions to hundreds via representation learning is essential to large-scale recommender systems. Standard approaches approximate maximum likelihood estimation (MLE) through sampling for better scalability and address the problem of DCG in a way similar to language modeling. However, live recommender systems face severe unfairness of exposure with a vocabulary several orders of magnitude larger than that of natural language, implying that (1) MLE will preserve and even exacerbate the exposure bias in the long run in order to faithfully fit the observed samples, and (2) suboptimal sampling and inadequate use of item features can lead to inferior representations for the unfairly ignored items. In this paper, we introduce CLRec, a Contrastive Learning paradigm that has been successfully deployed in a real-world massive recommender system, to alleviate exposure bias in DCG. We theoretically prove that a popular choice of contrastive loss is equivalently reducing the exposure bias via inverse propensity scoring, which provides a new perspective on the effectiveness of contrastive learning. We further employ a fixed-size queue to store the items' representations computed in previously processed batches, and use the queue to serve as an effective sampler of negative examples. This queue-based design provides great efficiency in incorporating rich features of the thousand negative items per batch thanks to computation reuse. Extensive offline analyses and four-month online A/B tests in Mobile Taobao demonstrate substantial improvement, including a dramatic reduction in the Matthew effect.