Voice Activity Projection Model with Multimodal Encoders