AVCap: Leveraging Audio-Visual Features as Text Tokens for Captioning