Cross-Modal Safety Alignment: Is textual unlearning all you need?