Correspondence of high-dimensional emotion structures elicited by video clips between humans and Multimodal LLMs