Language Model Based Text-to-Audio Generation: Anti-Causally Aligned Collaborative Residual Transformers