Is end-to-end learning enough for fitness activity recognition?
Mercier, Antoine, Berger, Guillaume, Panchal, Sunny, Letsch, Florian, Boehm, Cornelius, Kang, Nahua, Bax, Ingo, Memisevic, Roland
–arXiv.org Artificial Intelligence
End-to-end learning has taken hold of many computer vision tasks, in particular, related to still images, with task-specific optimization yielding very strong performance. Nevertheless, human-centric action recognition is still largely dominated by hand-crafted pipelines, and only individual components are replaced by neural networks that typically operate on individual frames. As a testbed to study the relevance of such pipelines, we present a new fully annotated video dataset of fitness activities. Any recognition capabilities in this domain are almost exclusively a function of human poses and their temporal dynamics, so pose-based solutions should perform well. We show that, with this labelled data, end-to-end learning on raw pixels can compete with state-of-the-art action recognition pipelines based on pose estimation. We also show that end-to-end learning can support temporally fine-grained tasks such as real-time repetition counting.
arXiv.org Artificial Intelligence
May-14-2023
- Genre:
- Research Report (0.40)
- Technology:
- Information Technology > Artificial Intelligence > Vision (1.00)