Li, Yizhi
Length is a Curse and a Blessing for Document-level Semantics
Xiao, Chenghao, Li, Yizhi, Hudson, G Thomas, Lin, Chenghua, Moubayed, Noura Al
In recent years, contrastive learning (CL) has been extensively utilized to recover sentence and document-level encoding capability from pre-trained language models. In this work, we question the length generalizability of CL-based models, i.e., their vulnerability towards length-induced semantic shift. We verify not only that length vulnerability is a significant yet overlooked research gap, but we can devise unsupervised CL methods solely depending on the semantic signal provided by document length. We first derive the theoretical foundations underlying length attacks, showing that elongating a document would intensify the high intra-document similarity that is already brought by CL. Moreover, we found that isotropy promised by CL is highly dependent on the length range of text exposed in training. Inspired by these findings, we introduce a simple yet universal document representation learning framework, LA(SER)$^{3}$: length-agnostic self-reference for semantically robust sentence representation learning, achieving state-of-the-art unsupervised performance on the standard information retrieval benchmark.
Audio Contrastive based Fine-tuning
Wang, Yang, Liang, Qibin, Xiao, Chenghao, Li, Yizhi, Moubayed, Noura Al, Lin, Chenghua
Audio classification plays a crucial role in speech and sound processing tasks with a wide range of applications. There still remains a challenge of striking the right balance between fitting the model to the training data (avoiding overfitting) and enabling it to generalise well to a new domain. Leveraging the transferability of contrastive learning, we introduce Audio Contrastive-based Fine-tuning (AudioConFit), an efficient approach characterised by robust generalisability. Empirical experiments on a variety of audio classification tasks demonstrate the effectiveness and robustness of our approach, which achieves state-of-the-art results in various settings.
On the Effectiveness of Speech Self-supervised Learning for Music
Ma, Yinghao, Yuan, Ruibin, Li, Yizhi, Zhang, Ge, Chen, Xingran, Yin, Hanzhi, Lin, Chenghua, Benetos, Emmanouil, Ragni, Anton, Gyenge, Norbert, Liu, Ruibo, Xia, Gus, Dannenberg, Roger, Guo, Yike, Fu, Jie
Self-supervised learning (SSL) has shown promising results in various speech and natural language processing applications. However, its efficacy in music information retrieval (MIR) still remains largely unexplored. While previous SSL models pre-trained on music recordings may have been mostly closed-sourced, recent speech models such as wav2vec2.0 have shown promise in music modelling. Nevertheless, research exploring the effectiveness of applying speech SSL models to music recordings has been limited. We explore the music adaption of SSL with two distinctive speech-related models, data2vec1.0 and Hubert, and refer to them as music2vec and musicHuBERT, respectively. We train $12$ SSL models with 95M parameters under various pre-training configurations and systematically evaluate the MIR task performances with 13 different MIR tasks. Our findings suggest that training with music data can generally improve performance on MIR tasks, even when models are trained using paradigms designed for speech. However, we identify the limitations of such existing speech-oriented designs, especially in modelling polyphonic information. Based on the experimental results, empirical suggestions are also given for designing future musical SSL strategies and paradigms.
MERT: Acoustic Music Understanding Model with Large-Scale Self-supervised Training
Li, Yizhi, Yuan, Ruibin, Zhang, Ge, Ma, Yinghao, Chen, Xingran, Yin, Hanzhi, Lin, Chenghua, Ragni, Anton, Benetos, Emmanouil, Gyenge, Norbert, Dannenberg, Roger, Liu, Ruibo, Chen, Wenhu, Xia, Gus, Shi, Yemin, Huang, Wenhao, Guo, Yike, Fu, Jie
Self-supervised learning (SSL) has recently emerged as a promising paradigm for training generalisable models on large-scale data in the fields of vision, text, and speech. Although SSL has been proven effective in speech and audio, its application to music audio has yet to be thoroughly explored. This is primarily due to the distinctive challenges associated with modelling musical knowledge, particularly its tonal and pitched characteristics of music. To address this research gap, we propose an acoustic Music undERstanding model with large-scale self-supervised Training (MERT), which incorporates teacher models to provide pseudo labels in the masked language modelling (MLM) style acoustic pre-training. In our exploration, we identified a superior combination of teacher models, which outperforms conventional speech and audio approaches in terms of performance. This combination includes an acoustic teacher based on Residual Vector Quantization - Variational AutoEncoder (RVQ-VAE) and a musical teacher based on the Constant-Q Transform (CQT). These teachers effectively guide our student model, a BERT-style transformer encoder, to better model music audio. In addition, we introduce an in-batch noise mixture augmentation to enhance the representation robustness. Furthermore, we explore a wide range of settings to overcome the instability in acoustic language model pre-training, which allows our designed paradigm to scale from 95M to 330M parameters. Experimental results indicate that our model can generalise and perform well on 14 music understanding tasks and attains state-of-the-art (SOTA) overall scores. The code and models are online: https://github.com/yizhilll/MERT.
Interactive Natural Language Processing
Wang, Zekun, Zhang, Ge, Yang, Kexin, Shi, Ning, Zhou, Wangchunshu, Hao, Shaochun, Xiong, Guangzheng, Li, Yizhi, Sim, Mong Yuan, Chen, Xiuying, Zhu, Qingqing, Yang, Zhenzhu, Nik, Adam, Liu, Qi, Lin, Chenghua, Wang, Shi, Liu, Ruibo, Chen, Wenhu, Xu, Ke, Liu, Dayiheng, Guo, Yike, Fu, Jie
Interactive Natural Language Processing (iNLP) has emerged as a novel paradigm within the field of NLP, aimed at addressing limitations in existing frameworks while aligning with the ultimate goals of artificial intelligence. This paradigm considers language models as agents capable of observing, acting, and receiving feedback iteratively from external entities. Specifically, language models in this context can: (1) interact with humans for better understanding and addressing user needs, personalizing responses, aligning with human values, and improving the overall user experience; (2) interact with knowledge bases for enriching language representations with factual knowledge, enhancing the contextual relevance of responses, and dynamically leveraging external information to generate more accurate and informed responses; (3) interact with models and tools for effectively decomposing and addressing complex tasks, leveraging specialized expertise for specific subtasks, and fostering the simulation of social behaviors; and (4) interact with environments for learning grounded representations of language, and effectively tackling embodied tasks such as reasoning, planning, and decision-making in response to environmental observations. This paper offers a comprehensive survey of iNLP, starting by proposing a unified definition and framework of the concept. We then provide a systematic classification of iNLP, dissecting its various components, including interactive objects, interaction interfaces, and interaction methods. We proceed to delve into the evaluation methodologies used in the field, explore its diverse applications, scrutinize its ethical and safety issues, and discuss prospective research directions. This survey serves as an entry point for researchers who are interested in this rapidly evolving area and offers a broad view of the current landscape and future trajectory of iNLP.
Chinese Open Instruction Generalist: A Preliminary Release
Zhang, Ge, Shi, Yemin, Liu, Ruibo, Yuan, Ruibin, Li, Yizhi, Dong, Siwei, Shu, Yu, Li, Zhaoqun, Wang, Zekun, Lin, Chenghua, Huang, Wenhao, Fu, Jie
Pre-trained large-scale language models (LLMs) have shown revolutionary performance in many downstream tasks (Guo et al., 2023; Wei et al., 2021). One crucial ability of LLMs is called instruction following. That is, models can complete the tasks described by instructions given as input. This ability is based on a specialized training stage called instruction tuning. Compared to unlabeled data used for pre-training, the data for instruction tuning is typically more goal-oriented, and it should explicitly demonstrate how a response follows its corresponding instruction with a given input. There are many instruction tuning datasets in English. For example, the FLAN collection (Longpre et al., 2023) contains 15M examples covering 1836 tasks, and OPT-IML (Iyer et al., 2022b) claims to have 18M examples for more than 2000 tasks (although it is still not publicly available). In contrast, existing data resources for Chinese instruction tuning are either small in scale or have questionable quality. For example, Ziang Leng and Li (2023) directly translate English instruction tuning data into Chinese, but do not consider mitigating translation errors or potential cultural gaps, e.g.
CORGI-PM: A Chinese Corpus For Gender Bias Probing and Mitigation
Zhang, Ge, Li, Yizhi, Wu, Yaoyao, Zhang, Linyuan, Lin, Chenghua, Geng, Jiayi, Wang, Shi, Fu, Jie
As natural language processing (NLP) for gender bias becomes a significant interdisciplinary topic, the prevalent data-driven techniques such as large-scale language models suffer from data inadequacy and biased corpus, especially for languages with insufficient resources such as Chinese. To this end, we propose a Chinese cOrpus foR Gender bIas Probing and Mitigation CORGI-PM, which contains 32.9k sentences with high-quality labels derived by following an annotation scheme specifically developed for gender bias in the Chinese context. Moreover, we address three challenges for automatic textual gender bias mitigation, which requires the models to detect, classify, and mitigate textual gender bias. We also conduct experiments with state-of-the-art language models to provide baselines. To our best knowledge, CORGI-PM is the first sentence-level Chinese corpus for gender bias probing and mitigation.
MAP-Music2Vec: A Simple and Effective Baseline for Self-Supervised Music Audio Representation Learning
Li, Yizhi, Yuan, Ruibin, Zhang, Ge, Ma, Yinghao, Lin, Chenghua, Chen, Xingran, Ragni, Anton, Yin, Hanzhi, Hu, Zhijie, He, Haoyu, Benetos, Emmanouil, Gyenge, Norbert, Liu, Ruibo, Fu, Jie
The deep learning community has witnessed an exponentially growing interest in self-supervised learning (SSL). However, it still remains unexplored how to build a framework for learning useful representations of raw music waveforms in a self-supervised manner. In this work, we design Music2Vec, a framework exploring different SSL algorithmic components and tricks for music audio recordings. Our model achieves comparable results to the state-of-the-art (SOTA) music SSL model Jukebox, despite being significantly smaller with less than 2% of parameters of the latter. The model will be released on Huggingface(Please refer to: https://huggingface.co/m-a-p/music2vec-v1)
HERB: Measuring Hierarchical Regional Bias in Pre-trained Language Models
Li, Yizhi, Zhang, Ge, Yang, Bohao, Lin, Chenghua, Wang, Shi, Ragni, Anton, Fu, Jie
Fairness has become a trending topic in natural language processing (NLP), which addresses biases targeting certain social groups such as genders and religions. However, regional bias in language models (LMs), a long-standing global discrimination problem, still remains unexplored. This paper bridges the gap by analysing the regional bias learned by the pre-trained language models that are broadly used in NLP tasks. In addition to verifying the existence of regional bias in LMs, we find that the biases on regional groups can be strongly influenced by the geographical clustering of the groups. We accordingly propose a HiErarchical Regional Bias evaluation method (HERB) utilising the information from the sub-region clusters to quantify the bias in pre-trained LMs. Experiments show that our hierarchical metric can effectively evaluate the regional bias with respect to comprehensive topics and measure the potential regional bias that can be propagated to downstream tasks. Our codes are available at https://github.com/Bernard-Yang/HERB.