Text-Speech Language Models with Improved Cross-Modal Transfer by Aligning Abstraction Levels

Open in new window