Attention Is Not Always the Answer: Optimizing Voice Activity Detection with Simple Feature Fusion