llasa
Llasa: Scaling Train-Time and Inference-Time Compute for Llama-based Speech Synthesis
Ye, Zhen, Zhu, Xinfa, Chan, Chi-Min, Wang, Xinsheng, Tan, Xu, Lei, Jiahe, Peng, Yi, Liu, Haohe, Jin, Yizhu, DAI, Zheqi, Lin, Hongzhan, Chen, Jianyi, Du, Xingjian, Xue, Liumeng, Chen, Yunlin, Li, Zhifei, Xie, Lei, Kong, Qiuqiang, Guo, Yike, Xue, Wei
Recent advances in text-based large language models (LLMs), particularly in the GPT series and the o1 model, have demonstrated the effectiveness of scaling both training-time and inference-time compute. However, current state-of-the-art TTS systems leveraging LLMs are often multi-stage, requiring separate models (e.g., diffusion models after LLM), complicating the decision of whether to scale a particular model during training or testing. This work makes the following contributions: First, we explore the scaling of train-time and inference-time compute for speech synthesis. Second, we propose a simple framework Llasa for speech synthesis that employs a single-layer vector quantizer (VQ) codec and a single Transformer architecture to fully align with standard LLMs such as Llama. Our experiments reveal that scaling train-time compute for Llasa consistently improves the naturalness of synthesized speech and enables the generation of more complex and accurate prosody patterns. Furthermore, from the perspective of scaling inference-time compute, we employ speech understanding models as verifiers during the search, finding that scaling inference-time compute shifts the sampling modes toward the preferences of specific verifiers, thereby improving emotional expressiveness, timbre consistency, and content accuracy. In addition, we released the checkpoint and training code for our TTS model (1B, 3B, 8B) and codec model publicly available.
- Asia > Maldives (0.04)
- Asia > China > Hong Kong (0.04)
- South America > Brazil > Rio de Janeiro > Rio de Janeiro (0.04)
- (15 more...)
- Personal (1.00)
- Research Report > New Finding (0.46)
- Leisure & Entertainment (1.00)
- Media (0.93)
LLaSA: Large Language and Structured Data Assistant
Xu, Yao, He, Shizhu, Xiangrong, Zeng, Chen, Jiabei, Liu, Guang, Wang, Bingning, Zhao, Jun, Liu, Kang
Structured data, such as tables, graphs, and databases, play a critical role in plentiful NLP tasks such as question answering and dialogue system. Recently, inspired by Vision-Language Models, Graph Neutral Networks (GNNs) have been introduced as an additional modality into the input of Large Language Models (LLMs) to improve their performance on Structured Knowledge Grounding (SKG) tasks. However, those GNN-enhanced LLMs have the following limitations: (1) They employ diverse GNNs to model varying types of structured data, rendering them unable to uniformly process various forms of structured data. (2) The pretraining of GNNs is coupled with specific LLMs, which prevents GNNs from fully aligning with the textual space and limits their adaptability to other LLMs. To address these issues, we propose \textbf{L}arge \textbf{L}anguage and \textbf{S}tructured Data \textbf{A}ssistant (LLaSA), a general framework for enhancing LLMs' ability to handle structured data. Specifically, we represent various types of structured data in a unified hypergraph format, and use self-supervised learning to pretrain a hypergraph encoder, and a G-Former compressing encoded hypergraph representations with cross-attention. The compressed hypergraph representations are appended to the serialized inputs during training and inference stages of LLMs. Experimental results on multiple SKG tasks show that our pretrained hypergraph encoder can adapt to various LLMs and enhance their ability to process different types of structured data. Besides, LLaSA, with LoRA fine-tuning, outperforms previous SOTA method using full parameters tuning.
- North America > United States > California > Los Angeles County > Los Angeles (0.14)
- Asia > China > Beijing > Beijing (0.04)
- North America > United States > Florida > Miami-Dade County > Miami (0.04)
- Asia > Myanmar > Tanintharyi Region > Dawei (0.04)
LLaSA: Large Language and E-Commerce Shopping Assistant
Zhang, Shuo, Peng, Boci, Zhao, Xinping, Hu, Boren, Zhu, Yun, Zeng, Yanjia, Hu, Xuming
The e-commerce platform has evolved rapidly due to its widespread popularity and convenience. Developing an e-commerce shopping assistant for customers is crucial to aiding them in quickly finding desired products and recommending precisely what they need. However, most previous shopping assistants face two main problems: (1) task-specificity, which necessitates the development of different models for various tasks, thereby increasing development costs and limiting effectiveness; and (2) poor generalization, where the trained model performs inadequately on up-to-date products. To resolve these issues, we employ Large Language Models (LLMs) to construct an omnipotent assistant, leveraging their adeptness at handling multiple tasks and their superior generalization capability. Nonetheless, LLMs lack inherent knowledge of e-commerce concepts. To address this, we create an instruction dataset comprising 65,000 samples and diverse tasks, termed as EshopInstruct. Through instruction tuning on our dataset, the assistant, named LLaSA, demonstrates the potential to function as an omnipotent assistant. Additionally, we propose various inference optimization strategies to enhance performance with limited inference resources. In the Amazon KDD Cup 2024 Challenge, our proposed method, LLaSA, achieved an overall ranking of 3rd place on ShopBench, including 57 tasks and approximately 20,000 questions, and we secured top-5 rankings in each track, especially in track4, where we achieved the best performance result among all student teams. Our extensive practices fully demonstrate that LLMs possess the great potential to be competent e-commerce shopping assistants.
- Europe > Spain > Catalonia > Barcelona Province > Barcelona (0.05)
- Asia > China > Guangdong Province > Guangzhou (0.05)
- Asia > China > Zhejiang Province > Hangzhou (0.05)
- (10 more...)
LLaSA: Large Multimodal Agent for Human Activity Analysis Through Wearable Sensors
Imran, Sheikh Asif, Khan, Mohammad Nur Hossain, Biswas, Subrata, Islam, Bashima
Integrating inertial measurement units (IMUs) with large language models (LLMs) advances multimodal AI by enhancing human activity understanding. We introduce SensorCaps, a dataset of 26,288 IMU-derived activity narrations, and OpenSQA, an instruction-following dataset with 257,562 question-answer pairs. Combining LIMU-BERT and Llama, we develop LLaSA, a Large Multimodal Agent capable of interpreting and responding to activity and motion analysis queries. Our evaluation demonstrates LLaSA's effectiveness in activity classification and question answering, highlighting its potential in healthcare, sports science, and human-computer interaction. These contributions advance sensor-aware language models and open new research avenues. Our code repository and datasets can be found on https://github.com/BASHLab/LLaSA.