Prompting Large Language Models with Audio for General-Purpose Speech Summarization
–arXiv.org Artificial Intelligence
In this work, we introduce a framework for speech summarization Our model is trained using the concept of modality invariance-- that leverages the processing and reasoning capabilities of the idea that, given certain semantic information in a prompt, large language models (LLMs). We propose an end-to-end system the LLM should provide the same response regardless of the that combines an instruction-tuned LLM with an audio encoder prompt's modality [12]. Specifically, we use an ASR dataset that converts speech into token representations that the with paired speech-text data; while keeping the LLM weights LLM can interpret. Using a dataset with paired speech-text data, frozen, we train the audio encoder to convert speech inputs into the overall system is trained to generate consistent responses to token representations that the LLM can interpret. Then, the endto-end prompts with the same semantic information regardless of the system is guided to produce the same output as when text input modality. The resulting framework allows the LLM to is the input using next-token prediction loss. We additionally process speech inputs in the same way as text, enabling speech incorporate knowledge distillation using the response from the summarization by simply prompting the LLM. Unlike prior approaches, corresponding text input as the teacher model, utilizing feature our method is able to summarize spoken content from and logit distillation losses to guide the model to produce more any arbitrary domain, and it can produce summaries in different consistent responses from speech inputs.
arXiv.org Artificial Intelligence
Jun-9-2024