Can Large Multimodal Models Uncover Deep Semantics Behind Images?