Caption-Driven Explorations: Aligning Image and Text Embeddings through Human-Inspired Foveated Vision