Refined Semantic Enhancement towards Frequency Diffusion for Video Captioning