Self-Explainable Affordance Learning with Embodied Caption