Resource-Efficient Federated Multimodal Learning via Layer-wise and Progressive Training