Seeing Through Words, Speaking Through Pixels: Deep Representational Alignment Between Vision and Language Models