Spatiotemporal Residual Networks for Video Action Recognition
Feichtenhofer, Christoph, Pinz, Axel, Wildes, Richard
–Neural Information Processing Systems
Two-stream Convolutional Networks (ConvNets) have shown strong performance for human action recognition in videos. Recently, Residual Networks (ResNets) have arisen as a new technique to train extremely deep architectures. In this paper, we introduce spatiotemporal ResNets as a combination of these two approaches. First, we inject residual connections between the appearance and motion pathways of a two-stream architecture to allow spatiotemporal interaction between the two streams. Second, we transform pretrained image ConvNets into spatiotemporal networks by equipping these with learnable convolutional filters that are initialized as temporal residual connections and operate on adjacent feature maps in time.
Neural Information Processing Systems
Feb-14-2020, 13:58:05 GMT
- Technology: