Towards Generating Diverse Audio Captions via Adversarial Training