Towards Multi-modal Transformers in Federated Learning