Multi-modal Agent Tuning: Building a VLM-Driven Agent for Efficient Tool Usage