Cangea, Cătălina
Active Acquisition for Multimodal Temporal Data: A Challenging Decision-Making Task
Kossen, Jannik, Cangea, Cătălina, Vértes, Eszter, Jaegle, Andrew, Patraucean, Viorica, Ktena, Ira, Tomasev, Nenad, Belgrave, Danielle
We introduce a challenging decision-making task that we call active acquisition for multimodal temporal data (A2MT). In many real-world scenarios, input features are not readily available at test time and must instead be acquired at significant cost. With A2MT, we aim to learn agents that actively select which modalities of an input to acquire, trading off acquisition cost and predictive performance. A2MT extends a previous task called active feature acquisition to temporal decision making about high-dimensional inputs. We propose a method based on the Perceiver IO architecture to address A2MT in practice. Our agents are able to solve a novel synthetic scenario requiring practically relevant cross-modal reasoning skills. On two large-scale, real-world datasets, Kinetics-700 and AudioSet, our agents successfully learn cost-reactive acquisition behavior. However, an ablation reveals they are unable to learn adaptive acquisition strategies, emphasizing the difficulty of the task even for state-of-the-art models. Applications of A2MT may be impactful in domains like medicine, robotics, or finance, where modalities differ in acquisition cost and informativeness.
Message Passing Neural Processes
Day, Ben, Cangea, Cătălina, Jamasb, Arian R., Liò, Pietro
Neural Processes (NPs) are powerful and flexible models able to incorporate uncertainty when representing stochastic processes, while maintaining a linear time complexity. However, NPs produce a latent description by aggregating independent representations of context points and lack the ability to exploit relational information present in many datasets. This renders NPs ineffective in settings where the stochastic process is primarily governed by neighbourhood rules, such as cellular automata (CA), and limits performance for any task where relational information remains unused. We address this shortcoming by introducing Message Passing Neural Processes (MPNPs), the first class of NPs that explicitly makes use of relational structure within the model. Our evaluation shows that MPNPs thrive at lower sampling rates, on existing benchmarks and newly-proposed CA and Cora-Branched tasks. We further report strong generalisation over density-based CA rule-sets and significant gains in challenging arbitrary-labelling and few-shot learning setups.
Generative Graph Perturbations for Scene Graph Prediction
Knyazev, Boris, de Vries, Harm, Cangea, Cătălina, Taylor, Graham W., Courville, Aaron, Belilovsky, Eugene
Inferring objects and their relationships from an image is useful in many applications at the intersection of vision and language. Due to a long tail data distribution, the task is challenging, with the inevitable appearance of zero-shot compositions of objects and relationships at test time. Current models often fail to properly understand a scene in such cases, as during training they only observe a tiny fraction of the distribution corresponding to the most frequent compositions. This motivates us to study whether increasing the diversity of the training distribution, by generating replacement for parts of real scene graphs, can lead to better generalization? We employ generative adversarial networks (GANs) conditioned on scene graphs to generate augmented visual features. To increase their diversity, we propose several strategies to perturb the conditioning. One of them is to use a language model, such as BERT, to synthesize plausible yet still unlikely scene graphs. By evaluating our model on Visual Genome, we obtain both positive and negative results. This prompts us to make several observations that can potentially lead to further improvements.
Wiki-CS: A Wikipedia-Based Benchmark for Graph Neural Networks
Mernyei, Péter, Cangea, Cătălina
We present Wiki-CS, a novel dataset derived from Wikipedia for benchmarking Graph Neural Networks. The dataset consists of nodes corresponding to Computer Science articles, with edges based on hyperlinks and 10 classes representing different branches of the field. We use the dataset to evaluate semi-supervised node classification and single-relation link prediction models. Our experiments show that these methods perform well on a new domain, with structural properties different from earlier benchmarks. The dataset is publicly available, along with the implementation of the data pipeline and the benchmark experiments, at https://github.com/pmernyei/wiki-cs-dataset .
VideoNavQA: Bridging the Gap between Visual and Embodied Question Answering
Cangea, Cătălina, Belilovsky, Eugene, Liò, Pietro, Courville, Aaron
Embodied Question Answering (EQA) is a recently proposed task, where an agent is placed in a rich 3D environment and must act based solely on its egocentric input to answer a given question. The desired outcome is that the agent learns to combine capabilities such as scene understanding, navigation and language understanding in order to perform complex reasoning in the visual world. However, initial advancements combining standard vision and language methods with imitation and reinforcement learning algorithms have shown EQA might be too complex and challenging for these techniques. In order to investigate the feasibility of EQA-type tasks, we build the VideoNavQA dataset that contains pairs of questions and videos generated in the House3D environment. The goal of this dataset is to assess question-answering performance from nearly-ideal navigation paths, while considering a much more complete variety of questions than current instantiations of the EQA task. We investigate several models, adapted from popular VQA methods, on this new benchmark. This establishes an initial understanding of how well VQA-style methods can perform within this novel EQA paradigm.
Spatio-Temporal Deep Graph Infomax
Opolka, Felix L., Solomon, Aaron, Cangea, Cătălina, Veličković, Petar, Liò, Pietro, Hjelm, R Devon
MUTUALINFORMATION MAXIMIZATION Deep InfoMax (DIM, Hjelm et al., 2019) is a recent approach for unsupervised representation learning that derives embeddings by maximizing the mutual information between the output of an encoder and local patches of the input. DIM builds on Mutual Information Neural Information (MINE, Belghazi et al., 2018), which formulates an estimate Î(X; Y) for the mutual information between random variables X, Y using neural networks. These estimates are obtained by training a classifier (a.k.a, the discriminator or statistics network) to distinguish between samples from the joint distribution and the product of marginals. DIM applies this approach to representation learning by training both the encoder and the discriminator to maximize the mutual information between the random variables corresponding to local input patches and the embeddings. Deep Graph Infomax (DGI) extends this representation learning technique to non-temporal graphs, finding node embeddings that maximize the mutual information between local patches of the graph and summaries of the entire graph. Here, we build on these methods and propose a representation learning technique for spatiotemporal graphs. Furthermore, unlike in previous work, we evaluate our embeddings in the regression rather than classification setting.
Structure-Based Networks for Drug Validation
Cangea, Cătălina, Grauslys, Arturas, Liò, Pietro, Falciani, Francesco
Classifying chemicals according to putative modes of action (MOAs) is of paramount importance in the context of risk assessment. However, current methods are only able to handle a very small proportion of the existing chemicals. We address this issue by proposing an integrative deep learning architecture that learns a joint representation from molecular structures of drugs and their effects on human cells. Our choice of architecture is motivated by the significant influence of a drug's chemical structure on its MOA. We improve on the strong ability of a unimodal architecture (F1 score of 0.803) to classify drugs by their toxic MOAs (Verhaar scheme) through adding another learning stream that processes transcriptional responses of human cells affected by drugs. Our integrative model achieves an even higher classification performance on the LINCS L1000 dataset - the error is reduced by 4.6%. We believe that our method can be used to extend the current Verhaar scheme and constitute a basis for fast drug validation and risk assessment.
Towards Sparse Hierarchical Graph Classifiers
Cangea, Cătălina, Veličković, Petar, Jovanović, Nikola, Kipf, Thomas, Liò, Pietro
Recent advances in representation learning on graphs, mainly leveraging graph convolutional networks, have brought a substantial improvement on many graph-based benchmark tasks. While novel approaches to learning node embeddings are highly suitable for node classification and link prediction, their application to graph classification (predicting a single label for the entire graph) remains mostly rudimentary, typically using a single global pooling step to aggregate node features or a hand-designed, fixed heuristic for hierarchical coarsening of the graph structure. An important step towards ameliorating this is differentiable graph coarsening---the ability to reduce the size of the graph in an adaptive, data-dependent manner within a graph neural network pipeline, analogous to image downsampling within CNNs. However, the previous prominent approach to pooling has quadratic memory requirements during training and is therefore not scalable to large graphs. Here we combine several recent advances in graph neural network design to demonstrate that competitive hierarchical graph classification results are possible without sacrificing sparsity. Our results are verified on several established graph classification benchmarks, and highlight an important direction for future research in graph-based neural networks.
XFlow: 1D-2D Cross-modal Deep Neural Networks for Audiovisual Classification
Cangea, Cătălina, Veličković, Petar, Liò, Pietro
We propose two multimodal deep learning architectures that allow for cross-modal dataflow (XFlow) between the feature extractors, thereby extracting more interpretable features and obtaining a better representation than through unimodal learning, for the same amount of training data. These models can usefully exploit correlations between audio and visual data, which have a different dimensionality and are therefore nontrivially exchangeable. Our work improves on existing multimodal deep learning metholodogies in two essential ways: (1) it presents a novel method for performing cross-modality (before features are learned from individual modalities) and (2) extends the previously proposed cross-connections, which only transfer information between streams that process compatible data. Both cross-modal architectures outperformed their baselines (by up to 7.5%) when evaluated on the AVletters dataset.