Efficient Multimodal Neural Networks for Trigger-less Voice Assistants