Investigating Fine- and Coarse-grained Structural Correspondences Between Deep Neural Networks and Human Object Image Similarity Judgments Using Unsupervised Alignment
Takahashi, Soh, Sasaki, Masaru, Takeda, Ken, Oizumi, Masafumi
–arXiv.org Artificial Intelligence
Investigating Fine-and Coarse-grained Structural Correspondences Between Deep Neural Networks and Human Object Image Similarity Judgments Using Unsupervised Alignment Soh Takahashi, Masaru Sasaki, Ken Takeda, Masafumi Oizumi Introduces an unsupervised alignment to assess human-like object representations of DNNs. CLIP models show highest fine-grained matching (20% top-1 match). This underscores the role of linguistic cues in refining representations. Image-only self-supervised models lack fine matching with human representations. Rather, they capture coarse category structures, hinting at prelinguistic links. Abstract The learning mechanisms by which humans acquire internal representations of objects are not fully understood. Deep neural networks (DNNs) have emerged as a useful tool for investigating this question, as they have internal representations similar to those of humans as a byproduct of optimizing their objective functions. While previous studies have shown that models trained with various learning paradigms--such as supervised, self-supervised, and CLIP--acquire human-like representations, it remains unclear whether their similarity to human representations is primarily at a coarse category level or extends to finer details. Here, we employ an unsupervised alignment method based on Gromov-Wasserstein Optimal Transport to compare human and model object representations at both fine-grained and coarse-grained levels. The unique feature of this method compared to conventional representational similarity analysis is that it estimates optimal fine-grained mappings between the representation of each object in human and model representations. We used this unsupervised alignment method to assess the extent to which the representation of each object in humans is correctly mapped to the corresponding representation of the same object in models. Using human similarity judgments of 1,854 objects from the THINGS dataset, we find that models trained with CLIP consistently achieve strong fine-and coarse-grained matching with human object representations.
arXiv.org Artificial Intelligence
Dec-2-2025
- Country:
- Asia > Japan
- Honshū
- Chūbu > Nagano Prefecture
- Nagano (0.04)
- Kantō > Tokyo Metropolis Prefecture
- Tokyo (0.14)
- Chūbu > Nagano Prefecture
- Honshū
- North America > United States
- New York > New York County
- New York City (0.14)
- Virginia (0.04)
- New York > New York County
- Asia > Japan
- Genre:
- Research Report > New Finding (1.00)
- Industry:
- Health & Medicine > Therapeutic Area > Neurology (0.68)
- Technology: