Multi-modal Representations for Fine-grained Multi-label Critical View of Safety Recognition