Soft-Prompting with Graph-of-Thought for Multi-modal Representation Learning