Fine-grained and Explainable Factuality Evaluation for Multimodal Summarization