Bridging the Gap in Vision Language Models in Identifying Unsafe Concepts Across Modalities