63ba665e01f39233674426ba36d6e177-Paper-Conference.pdf

Neural Information Processing Systems 

Humans judge perceptual similarity according to diverse visual attributes, including scene layout, subject location, and camera pose. Existing vision models understand a wide range of semantic abstractions but improperly weigh these attributes and thus make inferences misaligned with human perception. While vision representations have previously benefited from alignment in contexts like image generation, the utility of perceptually aligned representations in general-purpose settings remains unclear. Here, we investigate how aligning vision representations to human perceptual judgments impacts their usability across diverse vision tasks. We finetune state-of-the-art models on human similarity judgments for image triplets and evaluate them across standard benchmarks. We find that perceptual alignment yields representations that improve upon the original backbones across many tasks, including counting, segmentation, depth estimation, instance retrieval, and retrieval-augmented generation, while deteriorating performance on natural classification. Performance is widely preserved on other tasks, including specialized out-of-distribution domains such as in medical imaging and 3D environment frames. Our results suggest that injecting an inductive bias about human perceptual knowledge into vision models can contribute to better representations.

Duplicate Docs Excel Report

Title
None found

Similar Docs  Excel Report  more

TitleSimilaritySource
None found