Goto

Collaborating Authors

 einsum




RelationalSelf-Attention: What'sMissinginAttentionforVideoUnderstanding SupplementaryMaterial

Neural Information Processing Systems

Forthebottlenecks including RSAlayers, werandomly initializeweights using MSRA initialization [3] and set the gamma parameter of the last batch normalization layer to zero. We implement our model based on TSN in Pytorch2 under BSD 2-Clause license. All the benchmarks that we used are commonly used datasets for the academic purpose. While specified otherwise, the training and testing details are the sameasthoseinSec.5.1. Since each RSA kernel generated by each query captures a distinct motion pattern, the model can learn diverse motion features(seeFigure3). Inthisexperiment,wechooseL = 8asthedefault.



2cd2915e69546904e4e5d4a2ac9e1652-Supplemental.pdf

Neural Information Processing Systems

For easier derivation, we have introduced a notation ofqi. Sequence-level prediction This is essentially the case we consider in most of our experiments wherewewanttoobtain avectorial representation oftheinputsequence suchastextclassification. Finally, although we focus on discussion on the NLP tasks in this paper, Funnel-Transformer couldbeapplied toanytasksdealing withsequential data,suchastimeseries andvideostreamanalysis. B.1 Preprocessing&Tokenization For all experiments conducted in this work, we simply adapt the "uncased" word piece model originally used by BERT [2], where the vocabulary size is about 30K. Specifically,wefindthe training can be unstable when the depth goes beyond 24 layers (in the case of B10-10-10H1024) at base scale, especially for the MLM objective.



Collapsing Taylor Mode Automatic Differentiation

Dangel, Felix, Siebert, Tim, Zeinhofer, Marius, Walther, Andrea

arXiv.org Artificial Intelligence

Computing partial differential equation (PDE) operators via nested backpropagation is expensive, yet popular, and severely restricts their utility for scientific machine learning. Recent advances, like the forward Laplacian and randomizing Taylor mode automatic differentiation (AD), propose forward schemes to address this. We introduce an optimization technique for Taylor mode that 'collapses' derivatives by rewriting the computational graph, and demonstrate how to apply it to general linear PDE operators, and randomized Taylor mode. The modifications simply require propagating a sum up the computational graph, which could -- or should -- be done by a machine learning compiler, without exposing complexity to users. We implement our collapsing procedure and evaluate it on popular PDE operators, confirming it accelerates Taylor mode and outperforms nested backpropagation.


Autoconj: Recognizing and Exploiting Conjugacy Without a Domain-Specific Language

Matthew D. Hoffman

Neural Information Processing Systems

Deriving conditional and marginal distributions using conjugacy relationships can be time consuming and error prone. In this paper, we propose a strategy for automating such derivations. Unlike previous systems which focus on relationships between pairs of random variables, our system (which we call Autoconj) operates directly on Python functions that compute log-joint distribution functions. Autoconj provides support for conjugacy-exploiting algorithms in any Python-embedded PPL. This paves the way for accelerating development of novel inference algorithms and structure-exploiting modeling strategies.