What explains the success of cross-modal fine-tuning with ORCA?