Enhance audio generation controllability through representation similarity regularization