Multimodal Fusion with LLMs for Engagement Prediction in Natural Conversation