english language model
Methodology of Adapting Large English Language Models for Specific Cultural Contexts
Zhang, Wenjing, Xiao, Siqi, Lei, Xuejiao, Wang, Ning, Zhang, Huazheng, An, Meijuan, Yang, Bikun, Liu, Zhaoxiang, Wang, Kai, Lian, Shiguo
The rapid growth of large language models(LLMs) has emerged as a prominent trend in the field of artificial intelligence. However, current state-of-the-art LLMs are predominantly based on English. They encounter limitations when directly applied to tasks in specific cultural domains, due to deficiencies in domain-specific knowledge and misunderstandings caused by differences in cultural values. To address this challenge, our paper proposes a rapid adaptation method for large models in specific cultural contexts, which leverages instruction-tuning based on specific cultural knowledge and safety values data. Taking Chinese as the specific cultural context and utilizing the LLaMA3-8B as the experimental English LLM, the evaluation results demonstrate that the adapted LLM significantly enhances its capabilities in domain-specific knowledge and adaptability to safety values, while maintaining its original expertise advantages.
Identifying and Adapting Transformer-Components Responsible for Gender Bias in an English Language Model
Chintam, Abhijith, Beloch, Rahel, Zuidema, Willem, Hanna, Michael, van der Wal, Oskar
Language models (LMs) exhibit and amplify many types of undesirable biases learned from the training data, including gender bias. However, we lack tools for effectively and efficiently changing this behavior without hurting general language modeling performance. In this paper, we study three methods for identifying causal relations between LM components and particular output: causal mediation analysis, automated circuit discovery and our novel, efficient method called DiffMask+ based on differential masking. We apply the methods to GPT-2 small and the problem of gender bias, and use the discovered sets of components to perform parameter-efficient fine-tuning for bias mitigation. Our results show significant overlap in the identified components (despite huge differences in the computational requirements of the methods) as well as success in mitigating gender bias, with less damage to general language modeling compared to full model fine-tuning. However, our work also underscores the difficulty of defining and measuring bias, and the sensitivity of causal discovery procedures to dataset choice. We hope our work can contribute to more attention for dataset development, and lead to more effective mitigation strategies for other types of bias.
Getting Started with 5 Essential Natural Language Processing Libraries - KDnuggets
Let's say that you have an understanding of how to tackle natural language processing tasks. Let's also say that you have decided, more specifically, the type of approach you will employ in attempting to solve your task. You still need to put your plan into action, computationally, and there is a good chance you will be looking to leverage an existing NLP library to help you do so. Assuming you are programming in Python (I can't help you if not), there is quite a landscape of options to choose from. While this article is not an endorsement of any particular collection of such solutions, it serves as an overview to a curated list of 5 popular libraries you may look to in order to work on your problems.
Mozilla updates DeepSpeech with an English language model that runs 'faster than real time'
DeepSpeech, a suite of speech-to-text and text-to-speech engines maintained by Mozilla's Machine Learning Group, this morning received an update (to version 0.6) that incorporates one of the fastest open source speech recognition models to date. In a blog post, senior research engineer Reuben Morais lays out what's new and enhanced, as well as other spotlight features coming down the pipeline. The latest version of DeepSpeech adds support for TensorFlow Lite, a version of Google's TensorFlow machine learning framework that's optimized for compute-constrained mobile and embedded devices. It has reduced DeepSpeech's package size from 98MB to 3.7MB and its built-in English model size -- which has a 7.5% word error rate on a popular benchmark and which was trained on 5,516 hours of transcribed audio from WAMU (NPR), LibriSpeech, Fisher, Switchboard, and Mozilla's Common Voice English data sets -- from 188MB to 47MB. Plus, it has cut down DeepSpeech's memory consumption by 22 times and boosted its startup speed by over 500 times.
Getting Started with spaCy for Natural Language Processing
In a series of previous posts, we have looked at some general ideas related to textual data science tasks, be they natural language processing, text mining, or something different yet closely related. In the most recent of these posts, we covered a text data preprocessing walkthrough using Python. More specifically, we looked at a text preprocessing walkthrough using Python and NLTK. While we did not go any further than data preprocessing with NLTK, the toolkit could, theoretically, be used for further analysis tasks. While NLTK is a great natural language... well, toolkit (hence the name), it is not optimized for building production systems.