Recognition of Visually Perceived Compositional Human Actions by Multiple Spatio-Temporal Scales Recurrent Neural Networks

Lee, Haanvid, Jung, Minju, Tani, Jun

arXiv.org Artificial Intelligence 

Abstract--The current paper proposes a novel neural network model for recognizing visually perceived human actions. The proposed multiple spatiotemporal scales recurrent neural network (MSTRNN) model is derived by introducing multiple timescale recurrent dynamics to the conventional convolutional neural network model. One of the essential characteristics of the MSTRNN is that its architecture imposes both spatial and temporal constraints simultaneously on the neural activity which vary in multiple scales among different layers. As suggested by the principle of the upward and downward causation, it is assumed that the network can develop meaningful structures such as functional hierarchy by taking advantage of such constraints during the course of learning. T o evaluate the characteristics of the model, the current study uses three types of human action video dataset consisting of different types of primitive actions and different levels of compositionality on them. The performance of the MSTRNN in testing with these dataset is compared with the ones by other representative deep learning models used in the field. The analysis of the internal representation obtained through the learning with the dataset clarifies what sorts of functional hierarchy can be developed by extracting the essential compositionality underlying the dataset. ECENTL Y, a convolutional neural network (CNN) [1], inspired by a mammalian visual cortex, showed a remarkably better object image recognition performance than conventional vision recognition schemes which employ elaborately hand-coded visual features. A CNN trained with 1 million visual images from ImageNet [2] was able to classify hundreds of object images with an error rate of 6.67% [3], and demonstrated near-human performance [4]. As a consequence, CNNs are less effective in handling video image patterns than static images. To address this shortcoming, a number of action recognition models have been developed. H. Lee is with the Department of Electrical Engineering, Korea Institute of Science and Technology, Daejeon 305-701, Republic of Korea, email: (haanvidlee@gmail.com). M. Jung is with the Department of Electrical Engineering, Korea Institute of Science and Technology, Daejeon 305-701, Republic of Korea, email: (minju5436@gmail.com).

Duplicate Docs Excel Report

Title
None found

Similar Docs  Excel Report  more

TitleSimilaritySource
None found