Beyond Words: Multimodal LLM Knows When to Speak