When does perceptual alignment benefit vision representations?

Neural Information Processing Systems 

Humans judge perceptual similarity according to diverse visual attributes, including scene layout, subject location, and camera pose. Existing vision models understand a wide range of semantic abstractions but improperly weigh these attributes and thus make inferences misaligned with human perception. While vision representations have previously benefited from human preference alignment in contexts like image generation, the utility of perceptually aligned representations in more general-purpose settings remains unclear. Here, we investigate how aligning vision model representations to human perceptual judgments impacts their usability in standard computer vision tasks. We finetune state-of-the-art models on a dataset of human similarity judgments for synthetic image triplets and evaluate them across diverse computer vision tasks.