Yu, Dianzhi
Recent Advances of Multimodal Continual Learning: A Comprehensive Survey
Yu, Dianzhi, Zhang, Xinni, Chen, Yankai, Liu, Aiwei, Zhang, Yifei, Yu, Philip S., King, Irwin
Continual learning (CL) aims to empower machine learning models to learn continually from new data, while building upon previously acquired knowledge without forgetting. As machine learning models have evolved from small to large pre-trained architectures, and from supporting unimodal to multimodal data, multimodal continual learning (MMCL) methods have recently emerged. The primary challenge of MMCL is that it goes beyond a simple stacking of unimodal CL methods, as such straightforward approaches often yield unsatisfactory performance. In this work, we present the first comprehensive survey on MMCL. We provide essential background knowledge and MMCL settings, as well as a structured taxonomy of MMCL methods. We categorize existing MMCL methods into four categories, i.e., regularization-based, architecture-based, replay-based, and prompt-based methods, explaining their methodologies and highlighting their key innovations. Additionally, to prompt further research in this field, we summarize open MMCL datasets and benchmarks, and discuss several promising future directions for investigation and development. We have also created a GitHub repository for indexing relevant MMCL papers and open resources available at https://github.com/LucyDYu/Awesome-Multimodal-Continual-Learning.
Recent Advances in Speech Language Models: A Survey
Cui, Wenqian, Yu, Dianzhi, Jiao, Xiaoqi, Meng, Ziqiao, Zhang, Guangyan, Wang, Qichao, Guo, Yiwen, King, Irwin
Large Language Models (LLMs) have recently garnered significant attention, primarily for their capabilities in text-based interactions. However, natural human interaction often relies on speech, necessitating a shift towards voice-based models. A straightforward approach to achieve this involves a pipeline of ``Automatic Speech Recognition (ASR) + LLM + Text-to-Speech (TTS)", where input speech is transcribed to text, processed by an LLM, and then converted back to speech. Despite being straightforward, this method suffers from inherent limitations, such as information loss during modality conversion and error accumulation across the three stages. To address these issues, Speech Language Models (SpeechLMs) -- end-to-end models that generate speech without converting from text -- have emerged as a promising alternative. This survey paper provides the first comprehensive overview of recent methodologies for constructing SpeechLMs, detailing the key components of their architecture and the various training recipes integral to their development. Additionally, we systematically survey the various capabilities of SpeechLMs, categorize the evaluation metrics for SpeechLMs, and discuss the challenges and future research directions in this rapidly evolving field.
An Entropy-based Text Watermarking Detection Method
Lu, Yijian, Liu, Aiwei, Yu, Dianzhi, Li, Jingjing, King, Irwin
Text watermarking algorithms for large language models (LLMs) can effectively identify machine-generated texts by embedding and detecting hidden features in the text. Although the current text watermarking algorithms perform well in most high-entropy scenarios, its performance in low-entropy scenarios still needs to be improved. In this work, we opine that the influence of token entropy should be fully considered in the watermark detection process, $i.e.$, the weight of each token during watermark detection should be customized according to its entropy, rather than setting the weights of all tokens to the same value as in previous methods. Specifically, we propose \textbf{E}ntropy-based Text \textbf{W}atermarking \textbf{D}etection (\textbf{EWD}) that gives higher-entropy tokens higher influence weights during watermark detection, so as to better reflect the degree of watermarking. Furthermore, the proposed detection process is training-free and fully automated. From the experiments, we demonstrate that our EWD can achieve better detection performance in low-entropy scenarios, and our method is also general and can be applied to texts with different entropy distributions. Our code and data is available\footnote{\url{https://github.com/luyijian3/EWD}}. Additionally, our algorithm could be accessed through MarkLLM \cite{pan2024markllm}\footnote{\url{https://github.com/THU-BPM/MarkLLM}}.