Learning useful representations without supervision remains a key challenge in machine learning. In this paper, we propose a simple yet powerful generative model that learns such discrete representations. Our model, the Vector Quantised-Variational AutoEncoder (VQ-VAE), differs from VAEs in two key ways: the encoder network outputs discrete, rather than continuous, codes; and the prior is learnt rather than static. In order to learn a discrete latent representation, we incorporate ideas from vector quantisation (VQ). Using the VQ method allows the model to circumvent issues of posterior collapse'' --- where the latents are ignored when they are paired with a powerful autoregressive decoder --- typically observed in the VAE framework.
The recent success in human action recognition with deep learning methods mostly adopt the supervised learning paradigm, which requires significant amount of manually labeled data to achieve good performance. However, label collection is an expensive and time-consuming process. In this work, we propose an unsupervised learning framework, which exploits unlabeled data to learn video representations. Different from previous works in video representation learning, our unsupervised learning task is to predict 3D motion in multiple target views using video representation from a source view. By learning to extrapolate cross-view motions, the representation can capture view-invariant motion dynamics which is discriminative for the action.
Artificial intelligence refers to a variety of software and hardware technologies that can be applied in numerous ways for different applications. The terms'machine learning' and'deep learning' are often used interchangeably in the media, but they are not the same thing. In machine learning, the machine builds up the knowledge to complete specific actions based on training data covering multiple datasets. There are many examples of machine learning in our daily lives. The performance of machine learning algorithms is directly related to the available information, which is referred to as'representation'.
Self-supervised learning on graphs has recently drawn a lot of attention due to its independence from labels and its robustness in representation. Current studies on this topic mainly use static information such as graph structures but cannot well capture dynamic information such as timestamps of edges. Realistic graphs are often dynamic, which means the interaction between nodes occurs at a specific time. This paper proposes a self-supervised dynamic graph representation learning framework (DySubC), which defines a temporal subgraph contrastive learning task to simultaneously learn the structural and evolutional features of a dynamic graph. Specifically, a novel temporal subgraph sampling strategy is firstly proposed, which takes each node of the dynamic graph as the central node and uses both neighborhood structures and edge timestamps to sample the corresponding temporal subgraph. The subgraph representation function is then designed according to the influence of neighborhood nodes on the central node after encoding the nodes in each subgraph. Finally, the structural and temporal contrastive loss are defined to maximize the mutual information between node representation and temporal subgraph representation. Experiments on five real-world datasets demonstrate that (1) DySubC performs better than the related baselines including two graph contrastive learning models and four dynamic graph representation learning models in the downstream link prediction task, and (2) the use of temporal information can not only sample more effective subgraphs, but also learn better representation by temporal contrastive loss.
Self-Supervised Learning has become an exciting direction in AI community. Predicting What You Already Know Helps: Provable Self-Supervised Learning. For self-supervised learning, Rationality implies generalization, provably. Can Pretext-Based Self-Supervised Learning Be Boosted by Downstream Data? FAIR Self-Supervision Benchmark [pdf] [repo]: various benchmark (and legacy) tasks for evaluating quality of visual representations learned by various self-supervision approaches.