Fifty, Christopher
Restructuring Vector Quantization with the Rotation Trick
Fifty, Christopher, Junkins, Ronald G., Duan, Dennis, Iger, Aniketh, Liu, Jerry W., Amid, Ehsan, Thrun, Sebastian, Ré, Christopher
Vector Quantized Variational AutoEncoders (VQ-VAEs) are designed to compress a continuous input to a discrete latent space and reconstruct it with minimal distortion. They operate by maintaining a set of vectors--often referred to as the codebook--and quantizing each encoder output to the nearest vector in the codebook. However, as vector quantization is non-differentiable, the gradient to the encoder flows around the vector quantization layer rather than through it in a straight-through approximation. This approximation may be undesirable as all information from the vector quantization operation is lost. In this work, we propose a way to propagate gradients through the vector quantization layer of VQ-VAEs. We smoothly transform each encoder output into its corresponding codebook vector via a rotation and rescaling linear transformation that is treated as a constant during backpropagation. As a result, the relative magnitude and angle between encoder output and codebook vector becomes encoded into the gradient as it propagates through the vector quantization layer and back to the encoder. Across 11 different VQ-VAE training paradigms, we find this restructuring improves reconstruction metrics, codebook utilization, and quantization error. Vector quantization (Gray, 1984) is an approach to discretize a continuous vector space. It defines a finite set of vectors--referred to as the codebook--and maps any vector in the continuous vector space to the closest vector in the codebook.
Context-Aware Meta-Learning
Fifty, Christopher, Duan, Dennis, Junkins, Ronald G., Amid, Ehsan, Leskovec, Jure, Ré, Christopher, Thrun, Sebastian
Large Language Models like ChatGPT demonstrate a remarkable capacity to learn new concepts during inference without any fine-tuning. However, visual models trained to detect new objects during inference have been unable to replicate this ability, and instead either perform poorly or require meta-training and/or fine-tuning on similar objects. In this work, we propose a meta-learning algorithm that emulates Large Language Models by learning new visual concepts during inference without fine-tuning. Our approach leverages a frozen pre-trained feature extractor, and analogous to in-context learning, recasts meta-learning as sequence modeling over datapoints with known labels and a test datapoint with an unknown label. On 8 out of 11 meta-learning benchmarks, our approach -- without meta-training or fine-tuning -- exceeds or matches the state-of-the-art algorithm, P>M>F, which is meta-trained on these benchmarks.
In-Context Learning for Few-Shot Molecular Property Prediction
Fifty, Christopher, Leskovec, Jure, Thrun, Sebastian
In-context learning has become an important approach for few-shot learning in Large Language Models because of its ability to rapidly adapt to new tasks without fine-tuning model parameters. However, it is restricted to applications in natural language and inapplicable to other domains. In this paper, we adapt the concepts underpinning in-context learning to develop a new algorithm for few-shot molecular property prediction. Our approach learns to predict molecular properties from a context of (molecule, property measurement) pairs and rapidly adapts to new properties without fine-tuning. On the FS-Mol and BACE molecular property prediction benchmarks, we find this method surpasses the performance of recent meta-learning algorithms at small support sizes and is competitive with the best methods at large support sizes. In-context learning describes an emergent property of large language models (LLMs) that enables them to solve new tasks from only a few demonstrations and without any gradient updates to the model parameters (Brown et al., 2020). This capacity to rapidly adapt to new tasks contrasts sharply with typical few-shot learning algorithms that either use gradient updates, or distance computations to prototypical class centroids, to adapt the pre-trained model to the few-shot learning objective. As a result, in-context learning has become a powerful approach for few-shot learning applications in natural language; however, it is inapplicable to other domains as it uses a language modeling objective to train the model. One such domain is molecular science where few-shot learning is critical to drug discovery. After a biological target has been identified, finding small molecules that inhibit this target may lead to desirable outcomes. For example, inhibiting the protein 15-PGDH with a small molecule inhibitor leads to rejuvenation of aged skeletal muscle tissue in animal studies, effectively reverse-aging the cells (Palla et al., 2021).
Implicit Geometry and Interaction Embeddings Improve Few-Shot Molecular Property Prediction
Fifty, Christopher, Paggi, Joseph M., Amid, Ehsan, Leskovec, Jure, Dror, Ron
Few-shot learning is a promising approach to molecular property prediction as supervised data is often very limited. However, many important molecular properties depend on complex molecular characteristics -- such as the various 3D geometries a molecule may adopt or the types of chemical interactions it can form -- that are not explicitly encoded in the feature space and must be approximated from low amounts of data. Learning these characteristics can be difficult, especially for few-shot learning algorithms that are designed for fast adaptation to new tasks. In this work, we develop molecular embeddings that encode complex molecular characteristics to improve the performance of few-shot molecular property prediction. Our approach leverages large amounts of synthetic data, namely the results of molecular docking calculations, and a multi-task learning paradigm to structure the embedding space. On multiple molecular property prediction benchmarks, training from the embedding space substantially improves Multi-Task, MAML, and Prototypical Network few-shot learning performance. Our code is available at https://github.com/cfifty/IGNITE.
Measuring and Harnessing Transference in Multi-Task Learning
Fifty, Christopher, Amid, Ehsan, Zhao, Zhe, Yu, Tianhe, Anil, Rohan, Finn, Chelsea
Multi-task learning can leverage information learned by one task to benefit the training of other tasks. Despite this capacity, naïve formulations often degrade performance and in particular, identifying the tasks that would benefit from cotraining remains a challenging design question. In this paper, we analyze the dynamics of information transfer, or transference, across tasks throughout training. Specifically, we develop a similarity measure that can quantify transference among tasks and use this quantity to both better understand the optimization dynamics of multi-task learning as well as improve overall learning performance. In the latter case, we propose two methods to leverage our transference metric. The first operates at a macro-level by selecting which tasks should train together while the second functions at a micro-level by determining how to combine task gradients at each training step. We find these methods can lead to significant improvement over prior work on three supervised multi-task learning benchmarks and one multi-task reinforcement learning paradigm. Deciding if two or more objectives should be trained together in a multi-task model, as well as choosing how that model's parameters should be shared, is an inherently complex issue often left to human experts (Zhang & Yang, 2017). However, a human's understanding of similarity is motivated by their intuition and experience rather than a prescient knowledge of the underlying structures learned by a neural network.
Small Towers Make Big Differences
Wang, Yuyan, Zhao, Zhe, Dai, Bo, Fifty, Christopher, Lin, Dong, Hong, Lichan, Chi, Ed H.
Multi-task learning aims at solving multiple machine learning tasks at the same time. A good solution to a multi-task learning problem should be generalizable in addition to being Pareto optimal. In this paper, we provide some insights on understanding the trade-off between Pareto efficiency and generalization as a result of parameterization in multi-task deep learning models. As a multi-objective optimization problem, enough parameterization is needed for handling task conflicts in a constrained solution space; however, from a multi-task generalization perspective, over-parameterization undermines the benefit of learning a shared representation which helps harder tasks or tasks with limited training examples. A delicate balance between multi-task generalization and multi-objective optimization is therefore needed for finding a better trade-off between efficiency and generalization. To this end, we propose a method of under-parameterized self-auxiliaries for multi-task models to achieve the best of both worlds. It is task-agnostic and works with other multi-task learning algorithms. Empirical results show that small towers of under-parameterized self-auxiliaries can make big differences in improving Pareto efficiency in various multi-task applications.
Simplifying Graph Convolutional Networks
Wu, Felix, Zhang, Tianyi, Souza, Amauri Holanda de Jr., Fifty, Christopher, Yu, Tao, Weinberger, Kilian Q.
Graph Convolutional Networks (GCNs) and their variants have experienced significant attention and have become the de facto methods for learning graph representations. GCNs derive inspiration primarily from recent deep learning approaches, and as a result, may inherit unnecessary complexity and redundant computation. In this paper, we reduce this excess complexity through successively removing nonlinearities and collapsing weight matrices between consecutive layers. We theoretically analyze the resulting linear model and show that it corresponds to a fixed low-pass filter followed by a linear classifier. Notably, our experimental evaluation demonstrates that these simplifications do not negatively impact accuracy in many downstream applications. Moreover, the resulting model scales to larger datasets, is naturally interpretable, and yields up to two orders of magnitude speedup over FastGCN.