Ling, Zixuan
How Should We Build A Benchmark? Revisiting 274 Code-Related Benchmarks For LLMs
Cao, Jialun, Chan, Yuk-Kit, Ling, Zixuan, Wang, Wenxuan, Li, Shuqing, Liu, Mingwei, Qiao, Ruixi, Han, Yuting, Wang, Chaozheng, Yu, Boxi, He, Pinjia, Wang, Shuai, Zheng, Zibin, Lyu, Michael R., Cheung, Shing-Chi
Various benchmarks have been proposed to assess the performance of large language models (LLMs) in different coding scenarios. We refer to them as code-related benchmarks. However, there are no systematic guidelines by which such a benchmark should be developed to ensure its quality, reliability, and reproducibility. We propose How2Bench, which is comprised of a 55-criteria checklist as a set of guidelines to govern the development of code-related benchmarks comprehensively. Using HOW2BENCH, we profiled 274 benchmarks released within the past decade and found concerning issues. Nearly 70% of the benchmarks did not take measures for data quality assurance; over 10% did not even open source or only partially open source. Many highly cited benchmarks have loopholes, including duplicated samples, incorrect reference codes/tests/prompts, and unremoved sensitive/confidential information. Finally, we conducted a human study involving 49 participants, which revealed significant gaps in awareness of the importance of data quality, reproducibility, and transparency.
Advancing Parameter Efficiency in Fine-tuning via Representation Editing
Wu, Muling, Liu, Wenhao, Wang, Xiaohua, Li, Tianlong, Lv, Changze, Ling, Zixuan, Zhu, Jianhao, Zhang, Cenyuan, Zheng, Xiaoqing, Huang, Xuanjing
Parameter Efficient Fine-Tuning (PEFT) techniques have drawn significant attention due to their ability to yield competitive results while updating only a small portion of the adjustable parameters. However, existing PEFT methods pose challenges in hyperparameter selection, such as choosing the rank for LoRA or Adapter, or specifying the length of soft prompts. To address these challenges, we propose a novel fine-tuning approach for neural models, named Representation EDiting (RED), which modifies the representations generated at some layers through the application of scaling and biasing operations. While existing PEFT methods still demonstrate over-parameterization that could potentially undermine the generalization ability acquired from pre-training, RED can substantially reduce the number of trainable parameters by a factor of 25, 700 compared to full parameter fine-tuning and by a factor of 32 relative to LoRA. Remarkably, RED achieves results comparable or superior to both full parameter fine-tuning and other PEFT methods. Extensive experiments across various model architectures and scales, including RoBERTa, GPT-2, T5, and LLaMA-2, have demonstrated the effectiveness and efficiency of RED1, thereby positioning it as a promising PEFT strategy for large-scale neural models.
Decoding Continuous Character-based Language from Non-invasive Brain Recordings
Zhang, Cenyuan, Zheng, Xiaoqing, Yin, Ruicheng, Geng, Shujie, Xu, Jianhan, Gao, Xuan, Lv, Changze, Ling, Zixuan, Huang, Xuanjing, Cao, Miao, Feng, Jianfeng
Over the past decade, advancements in brain-computer interfaces have demonstrated the feasibility of decoding various forms of communication, such as speech sounds [80, 81], hand gestures [79, 82], articulatory movements [77, 78], and other signals [76] from intracranial recordings. Despite their efficacy, the requirement for invasive brain surgery limits the applicability of these decoding methods to patients with severe impediments in speech or communication due to neurodegenerative diseases, strokes, or traumatic brain injuries. In contrast, non-invasive recordings, particularly those employing functional magnetic resonance imaging (fMRI) [72, 74], magnetoencephalography (MEG) and electroencephalography (EEG) [73], have demonstrated the ability to record rich linguistic information, and decoding natural language from such non-invasive recordings holds the potential for broader applications in both restorative interventions and augmentative technologies. Previous efforts to decode natural language from non-invasive recordings have primarily focused on recognizing letters, words, or fragments within a predetermined set of possibilities [66-69, 72, 73]. A recent breakthrough has demonstrated the feasibility of decoding continuous language from non-invasive recordings of native English speakers [65].
Tailoring Personality Traits in Large Language Models via Unsupervisedly-Built Personalized Lexicons
Li, Tianlong, Dou, Shihan, Lv, Changze, Liu, Wenhao, Xu, Jianhan, Wu, Muling, Ling, Zixuan, Zheng, Xiaoqing, Huang, Xuanjing
Personality plays a pivotal role in shaping human expression patterns, thus regulating the personality of large language models (LLMs) holds significant potential in enhancing the user experience of LLMs. Previous methods either relied on fine-tuning LLMs on specific corpora or necessitated manually crafted prompts to elicit specific personalities from LLMs. However, the former approach is inefficient and costly, while the latter cannot precisely manipulate personality traits at a fine-grained level. To address the above challenges, we have employed a novel Unsupervisedly-Built Personalized Lexicons (UBPL) in a pluggable manner during the decoding phase of LLMs to manipulate their personality traits. UBPL is a lexicon built through an unsupervised approach from a situational judgment test dataset (SJTs4LLM). Users can utilize UBPL to adjust the probability vectors of predicted words in the decoding phase of LLMs, thus influencing the personality expression of LLMs. Extensive experimentation demonstrates the remarkable effectiveness and pluggability of our method for fine-grained manipulation of LLM's personality.
Aligning Large Language Models with Human Preferences through Representation Engineering
Liu, Wenhao, Wang, Xiaohua, Wu, Muling, Li, Tianlong, Lv, Changze, Ling, Zixuan, Zhu, Jianhao, Zhang, Cenyuan, Zheng, Xiaoqing, Huang, Xuanjing
Aligning large language models (LLMs) with human preferences is crucial for enhancing their utility in terms of helpfulness, truthfulness, safety, harmlessness, and interestingness. Existing methods for achieving this alignment often involves employing reinforcement learning from human feedback (RLHF) to fine-tune LLMs based on human labels assessing the relative quality of model responses. Nevertheless, RLHF is susceptible to instability during fine-tuning and presents challenges in implementation.Drawing inspiration from the emerging field of representation engineering (RepE), this study aims to identify relevant representations for high-level human preferences embedded in patterns of activity within an LLM, and achieve precise control of model behavior by transforming its representations. This novel approach, denoted as Representation Alignment from Human Feedback (RAHF), proves to be effective, computationally efficient, and easy to implement.Extensive experiments demonstrate the efficacy of RAHF in not only capturing but also manipulating representations to align with a broad spectrum of human preferences or values, rather than being confined to a singular concept or function (e.g. honesty or bias). RAHF's versatility in accommodating diverse human preferences shows its potential for advancing LLM performance.
SpikeBERT: A Language Spikformer Learned from BERT with Knowledge Distillation
Lv, Changze, Li, Tianlong, Xu, Jianhan, Gu, Chenxi, Ling, Zixuan, Zhang, Cenyuan, Zheng, Xiaoqing, Huang, Xuanjing
Spiking neural networks (SNNs) offer a promising avenue to implement deep neural networks in a more energy-efficient way. However, the network architectures of existing SNNs for language tasks are still simplistic and relatively shallow, and deep architectures have not been fully explored, resulting in a significant performance gap compared to mainstream transformer-based networks such as BERT. To this end, we improve a recently-proposed spiking Transformer (i.e., Spikformer) to make it possible to process language tasks and propose a two-stage knowledge distillation method for training it, which combines pre-training by distilling knowledge from BERT with a large collection of unlabelled texts and fine-tuning with task-specific instances via knowledge distillation again from the BERT fine-tuned on the same training examples. Through extensive experimentation, we show that the models trained with our method, named SpikeBERT, outperform state-of-the-art SNNs and even achieve comparable results to BERTs on text classification tasks for both English and Chinese with much less energy consumption. Modern artificial neural networks (ANNs) have been highly successful in a wide range of natural language processing (NLP) and computer vision (CV) tasks. However, it requires too much computational energy to train and deploy state-of-the-art ANN models, leading to a consistent increase of energy consumption per model over the past decade. The energy consumption of large language models during inference, such as ChatGPT (OpenAI, 2022) and GPT-4 (OpenAI, 2023), is unfathomable. In recent years, spiking neural networks (SNNs), arguably known as the third generation of neural network (Maas, 1997), have attracted a lot of attention due to their high biological plausibility, event-driven property and low energy consumption (Roy et al., 2019). Like biological neurons, SNNs use discrete spikes to process and transmit information.