Voice Activity Projection Model with Multimodal Encoders

Open in new window