spatio-temporal learning
Convolutional Tensor-Train LSTM for Spatio-Temporal Learning
Learning from spatio-temporal data has numerous applications such as human-behavior analysis, object tracking, video compression, and physics simulation. However, existing methods still perform poorly on challenging video tasks such as long-term forecasting. This is because these kinds of challenging tasks require learning long-term spatio-temporal correlations in the video sequence. In this paper, we propose a higher-order convolutional LSTM model that can efficiently learn these correlations, along with a succinct representations of the history. This is accomplished through a novel tensor train module that performs prediction by combining convolutional features across time. To make this feasible in terms of computation and memory requirements, we propose a novel convolutional tensor-train decomposition of the higher-order model. This decomposition reduces the model complexity by jointly approximating a sequence of convolutional kernels as a low-rank tensor-train factorization. As a result, our model outperforms existing approaches, but uses only a fraction of parameters, including the baseline models. Our results achieve state-of-the-art performance in a wide range of applications and datasets, including the multi-steps video prediction on the Moving-MNIST-2 and KTH action datasets as well as early activity recognition on the Something-Something V2 dataset.
Shifted Chunk Transformer for Spatio-Temporal Representational Learning
Spatio-temporal representational learning has been widely adopted in various fields such as action recognition, video object segmentation, and action anticipation.Previous spatio-temporal representational learning approaches primarily employ ConvNets or sequential models, e.g., LSTM, to learn the intra-frame and inter-frame features. Recently, Transformer models have successfully dominated the study of natural language processing (NLP), image classification, etc. However, the pure-Transformer based spatio-temporal learning can be prohibitively costly on memory and computation to extract fine-grained features from a tiny patch. To tackle the training difficulty and enhance the spatio-temporal learning, we construct a shifted chunk Transformer with pure self-attention blocks. Leveraging the recent efficient Transformer design in NLP, this shifted chunk Transformer can learn hierarchical spatio-temporal features from a local tiny patch to a global videoclip. Our shifted self-attention can also effectively model complicated inter-frame variances. Furthermore, we build a clip encoder based on Transformer to model long-term temporal dependencies. We conduct thorough ablation studies to validate each component and hyper-parameters in our shifted chunk Transformer, and it outperforms previous state-of-the-art approaches on Kinetics-400, Kinetics-600,UCF101, and HMDB51.
Review for NeurIPS paper: Convolutional Tensor-Train LSTM for Spatio-Temporal Learning
Weaknesses: The two major weaknesses are a lack of comparison to previous work by Yang et. Although, it does not rely on the same structure as this work (smooth evolution over time in video data vs tensor train), it does rely on somewhat of a similar structure (i.e. Looking at these two side by side, I appreciate their difference, however I think they're still too similar to not require a comparison. One could conceivably imagine that the same underlying structure is exploited by both approaches, which diminishes the novelty of the work. It remains to be seen whether this application of tensor train is orthogonal to the application of tensor train by Yang et.
Review for NeurIPS paper: Convolutional Tensor-Train LSTM for Spatio-Temporal Learning
This paper develops a higher-Markov-order convolutional LSTM based on tensor train decomposition, with applications to spatio-temporal activity analysis in videos. The reviews were mixed but marginally positive on average, and the scores increased slightly following the rebuttal and some discussion.There is a consensus that the approach is novel and interesting. The main criticism is that despite the extensive experiments it remains unclear whether it is novel formulation itself that is producing the observed improvements, or the many other points that differ relative to the baselines. The advantages of using Markov order 1 in this application also need to be clarified. Overall, the AC and SAC agreed that this was above threshold for NeurIPS.
Convolutional Tensor-Train LSTM for Spatio-Temporal Learning
Learning from spatio-temporal data has numerous applications such as human-behavior analysis, object tracking, video compression, and physics simulation. However, existing methods still perform poorly on challenging video tasks such as long-term forecasting. This is because these kinds of challenging tasks require learning long-term spatio-temporal correlations in the video sequence. In this paper, we propose a higher-order convolutional LSTM model that can efficiently learn these correlations, along with a succinct representations of the history. This is accomplished through a novel tensor train module that performs prediction by combining convolutional features across time.
Shifted Chunk Transformer for Spatio-Temporal Representational Learning
Spatio-temporal representational learning has been widely adopted in various fields such as action recognition, video object segmentation, and action anticipation.Previous spatio-temporal representational learning approaches primarily employ ConvNets or sequential models, e.g., LSTM, to learn the intra-frame and inter-frame features. Recently, Transformer models have successfully dominated the study of natural language processing (NLP), image classification, etc. However, the pure-Transformer based spatio-temporal learning can be prohibitively costly on memory and computation to extract fine-grained features from a tiny patch. To tackle the training difficulty and enhance the spatio-temporal learning, we construct a shifted chunk Transformer with pure self-attention blocks. Leveraging the recent efficient Transformer design in NLP, this shifted chunk Transformer can learn hierarchical spatio-temporal features from a local tiny patch to a global videoclip.