Style-transfer based Speech and Audio-visual Scene Understanding for Robot Action Sequence Acquisition from Videos