Autonomous Workflow for Multimodal Fine-Grained Training Assistants Towards Mixed Reality