When Vision Models Meet Parameter Efficient Look-Aside Adapters Without Large-Scale Audio Pretraining