Exploring Efficient-Tuned Learning Audio Representation Method from BriVL