Keeping Your Eye on the Ball: Trajectory Attention in Video Transformers

Neural Information Processing Systems

In video transformers, the time dimension is often treated in the same way as the two spatial dimensions. However, in a scene where objects or the camera may move, a physical point imaged at one location in frame t may be entirely unrelated to what is found at that location in frame t + k. These temporal correspondences should be modeled to facilitate learning about dynamic scenes. To this end, we propose a new drop-in block for video transformers--trajectory attention--that aggregates information along implicitly determined motion paths. We additionally propose a new method to address the quadratic dependence of computation and memory on the input size, which is particularly important for high resolution or long videos. While these ideas are useful in a range of settings, we apply them to the specific task of video action recognition with a transformer model and obtain state-of-the-art results on the Kinetics, Something-Something V2, and Epic-Kitchens datasets.


MV2Cyl: Reconstructing 3D Extrusion Cylinders from Multi-View Images

Neural Information Processing Systems

Extracting extrusion cylinders from raw 3D geometry has been extensively researched in computer vision, while the processing of 3D data through neural networks has remained a bottleneck. Since 3D scans are generally accompanied by multi-view images, leveraging 2D convolutional neural networks allows these images to be exploited as a rich source for extracting extrusion cylinder information. However, we observe that extracting only the surface information of the extrudes and utilizing it results in suboptimal outcomes due to the challenges in the occlusion and surface segmentation. By synergizing with the extracted base curve information, we achieve the optimal reconstruction result with the best accuracy in 2D sketch and extrude parameter estimation. Our experiments, comparing our method with previous work that takes a raw 3D point cloud as input, demonstrate the effectiveness of our approach by taking advantage of multi-view images. Our project page can be found at https://mv2cyl.github.io.



A Numerically stable Multinomial Diffusion in log space

Neural Information Processing Systems

B.1 Language Modelling For the language modelling experiments we utilize the standard text8 dataset with sequence length 256 and enwik8 dataset with sequence length 320. The train/val/test splits are 90000000/5000000/5000000 for both text8 and enwik8, as is standard in literature. The Multinomial Text Diffusion models are trained for 300 epochs, whereas the Argmax Flows are trained for 40 epochs, with the exception of the Argmax Coupling Flow on enwik8 which only needs to be trained for 20 epochs. Further details are presented in Tables 6 and 7. In addition, the code to reproduce results will be publicly available. There are no known ethics issues with these datasets at the time of writing.



Implicit Regularization Paths of Weighted Neural Representations

Neural Information Processing Systems

We study the implicit regularization effects induced by (observation) weighting of pretrained features. For weight and feature matrices of bounded operator norms that are infinitesimally free with respect to (normalized) trace functionals, we derive equivalence paths connecting different weighting matrices and ridge regularization levels. Specifically, we show that ridge estimators trained on weighted features along the same path are asymptotically equivalent when evaluated against test vectors of bounded norms. These paths can be interpreted as matching the effective degrees of freedom of ridge estimators fitted with weighted features. For the special case of subsampling without replacement, our results apply to independently sampled random features and kernel features and confirm recent conjectures (Conjectures 7 and 8) of the authors on the existence of such paths in [50]. We also present an additive risk decomposition for ensembles of weighted estimators and show that the risks are equivalent along the paths when the ensemble size goes to infinity. As a practical consequence of the path equivalences, we develop an efficient cross-validation method for tuning and apply it to subsampled pretrained representations across several models (e.g., ResNet-50) and datasets (e.g., CIFAR-100).




Learning Human-like Representations to Enable Learning Human Values Department of Computer Science Department of Computer Science Princeton University

Neural Information Processing Systems

How can we build AI systems that can learn any set of individual human values both quickly and safely, avoiding causing harm or violating societal standards for acceptable behavior during the learning process? We explore the effects of representational alignment between humans and AI agents on learning human values. Making AI systems learn human-like representations of the world has many known benefits, including improving generalization, robustness to domain shifts, and few-shot learning performance. We demonstrate that this kind of representational alignment can also support safely learning and exploring human values in the context of personalization. We begin with a theoretical prediction, show that it applies to learning human morality judgments, then show that our results generalize to ten different aspects of human values - including ethics, honesty, and fairness - training AI agents on each set of values in a multi-armed bandit setting, where rewards reflect human value judgments over the chosen action. Using a set of textual action descriptions, we collect value judgments from humans, as well as similarity judgments from both humans and multiple language models, and demonstrate that representational alignment enables both safe exploration and improved generalization when learning human values.


Global Convergence of Online Optimization for Nonlinear Model Predictive Control

Neural Information Processing Systems

We study a real-time iteration (RTI) scheme for solving online optimization problem appeared in nonlinear optimal control. The proposed RTI scheme modifies the existing RTI-based model predictive control (MPC) algorithm, by selecting the stepsize of each Newton step at each sampling time using a differentiable exact augmented Lagrangian. The scheme can adaptively select the penalty parameters of augmented Lagrangian on the fly, which are shown to be stabilized after certain time periods. We prove under generic assumptions that, by involving stepsize selection instead of always using a full Newton step (like what most of the existing RTIs do), the scheme converges globally: for any initial point, the KKT residuals of the subproblems converge to zero. A key step is to show that augmented Lagrangian keeps decreasing as horizon moves forward. We demonstrate the global convergence behavior of the proposed RTI scheme in a numerical experiment.