Robust Latent Representation Tuning for Image-text Classification