Learning multimodal representations for sample-efficient recognition of human actions