2 A.1 Effect of UNet layers 2 A.2 Effect of dimensionality reduction 2 A.3 Effect of fusion strategy 2 A.4 Effect of captioner and timestep

Neural Information Processing Systems 

To further understand the contributions of each component in our method as well as the impact of various design choices, we conduct a series of ablation studies on the SPair-71k dataset [7]. The quantitative results are reported in terms of PCK at different κ thresholds, and we sample 20 pairs for each category. We report PCK@κ (κ = 0.01, 0.05, 0.10) for each setting and both the Stable Diffusion and Fuse-ViT-B/14 methods. We analyze how features extracted at different layers in the U-Net architecture affect the accuracy, specifically at layers 2, 5, and 8, for the Stable Diffusion (SD) and Fuse-ViT-B/14 methods. The experiment results in Tab. 1 suggest that layer 5 alone provides substantial performance for both the Stable Diffusion and the fused features, while gathering all three layers further improves the overall performance for the fused features.