Robot Confirmation Generation and Action Planning Using Long-context Q-Former Integrated with Multimodal LLM