Investigating and Enhancing Vision-Audio Capability in Omnimodal Large Language Models