Goto

Collaborating Authors

 sememe


SEE: Sememe Entanglement Encoding for Transformer-bases Models Compression

arXiv.org Artificial Intelligence

Transformer-based large language models exhibit groundbreaking capabilities, but their storage and computational costs are prohibitively high, limiting their application in resource-constrained scenarios. An effective approach is to eliminate redundant model parameters and computational costs while incorporating efficient expert-derived knowledge structures to achieve a balance between compression and performance. Therefore, we propose the \textit{Sememe Entanglement Encoding (SEE)} algorithm. Guided by expert prior knowledge, the model is compressed through the low-rank approximation idea. In Entanglement Embedding, basic semantic units such as sememes are represented as low-dimensional vectors, and then reconstructed into high-dimensional word embeddings through the combination of generalized quantum entanglement. We adapt the Sememe Entanglement Encoding algorithm to transformer-based models of different magnitudes. Experimental results indicate that our approach achieves stable performance while compressing model parameters and computational costs.


SememeLM: A Sememe Knowledge Enhanced Method for Long-tail Relation Representation

arXiv.org Artificial Intelligence

Recognizing relations between two words is a fundamental task with the broad applications. Different from extracting relations from text, it is difficult to identify relations among words without their contexts. Especially for long-tail relations, it becomes more difficult due to inadequate semantic features. Existing approaches based on language models (LMs) utilize rich knowledge of LMs to enhance the semantic features of relations. However, they capture uncommon relations while overlooking less frequent but meaningful ones since knowledge of LMs seriously relies on trained data where often represents common relations. On the other hand, long-tail relations are often uncommon in training data. It is interesting but not trivial to use external knowledge to enrich LMs due to collecting corpus containing long-tail relationships is hardly feasible. In this paper, we propose a sememe knowledge enhanced method (SememeLM) to enhance the representation of long-tail relations, in which sememes can break the contextual constraints between wors. Firstly, we present a sememe relation graph and propose a graph encoding method. Moreover, since external knowledge base possibly consisting of massive irrelevant knowledge, the noise is introduced. We propose a consistency alignment module, which aligns the introduced knowledge with LMs, reduces the noise and integrates the knowledge into the language model. Finally, we conducted experiments on word analogy datasets, which evaluates the ability to distinguish relation representations subtle differences, including long-tail relations. Extensive experiments show that our approach outperforms some state-of-the-art methods.


SememeASR: Boosting Performance of End-to-End Speech Recognition against Domain and Long-Tailed Data Shift with Sememe Semantic Knowledge

arXiv.org Artificial Intelligence

Recently, excellent progress has been made in speech recognition. However, pure data-driven approaches have struggled to solve the problem in domain-mismatch and long-tailed data. Considering that knowledge-driven approaches can help data-driven approaches alleviate their flaws, we introduce sememe-based semantic knowledge information to speech recognition (SememeASR). Sememe, according to the linguistic definition, is the minimum semantic unit in a language and is able to represent the implicit semantic information behind each word very well. Our experiments show that the introduction of sememe information can improve the effectiveness of speech recognition. In addition, our further experiments show that sememe knowledge can improve the model's recognition of long-tailed data and enhance the model's domain generalization ability.


TKDP: Threefold Knowledge-enriched Deep Prompt Tuning for Few-shot Named Entity Recognition

arXiv.org Artificial Intelligence

Few-shot named entity recognition (NER) exploits limited annotated instances to identify named mentions. Effectively transferring the internal or external resources thus becomes the key to few-shot NER. While the existing prompt tuning methods have shown remarkable few-shot performances, they still fail to make full use of knowledge. In this work, we investigate the integration of rich knowledge to prompt tuning for stronger few-shot NER. We propose incorporating the deep prompt tuning framework with threefold knowledge (namely TKDP), including the internal 1) context knowledge and the external 2) label knowledge & 3) sememe knowledge. TKDP encodes the three feature sources and incorporates them into the soft prompt embeddings, which are further injected into an existing pre-trained language model to facilitate predictions. On five benchmark datasets, our knowledge-enriched model boosts by at most 11.53% F1 over the raw deep prompt method, and significantly outperforms 8 strong-performing baseline systems in 5-/10-/20-shot settings, showing great potential in few-shot NER. Our TKDP can be broadly adapted to other few-shot tasks without effort.


The Analysis about Building Cross-lingual Sememe Knowledge Base Based on Deep Clustering Network

arXiv.org Artificial Intelligence

A sememe is defined as the minimum semantic unit of human languages. Sememe knowledge bases (KBs), which contain words annotated with sememes, have been successfully applied to many NLP tasks, and we believe that by learning the smallest unit of meaning, computers can more easily understand human language. However, Existing sememe KBs are built on only manual annotation, human annotations have personal understanding biases, and the meaning of vocabulary will be constantly updated and changed with the times, and artificial methods are not always practical. To address the issue, we propose an unsupervised method based on a deep clustering network (DCN) to build a sememe KB, and you can use any language to build a KB through this method. We first learn the distributed representation of multilingual words, use MUSE to align them in a single vector space, learn the multi-layer meaning of each word through the self-attention mechanism, and use a DNC to cluster sememe features. Finally, we completed the prediction using only the 10-dimensional sememe space in English. We found that the low-dimensional space can still retain the main feature of the sememes.


Automatic Construction of Sememe Knowledge Bases via Dictionaries

arXiv.org Artificial Intelligence

A sememe is defined as the minimum semantic unit in linguistics. Sememe knowledge bases (SKBs), which comprise words annotated with sememes, enable sememes to be applied to natural language processing. So far a large body of research has showcased the unique advantages and effectiveness of SKBs in various tasks. However, most languages have no SKBs, and manual construction of SKBs is time-consuming and labor-intensive. To tackle this challenge, we propose a simple and fully automatic method of building an SKB via an existing dictionary. We use this method to build an English SKB and a French SKB, and conduct comprehensive evaluations from both intrinsic and extrinsic perspectives. Experimental results demonstrate that the automatically built English SKB is even superior to HowNet, the most widely used SKB that takes decades to build manually. And both the English and French SKBs can bring obvious performance enhancement in multiple downstream tasks. All the code and data of this paper (except the copyrighted dictionaries) can be obtained at https://github.com/thunlp/DictSKB.


LET: Linguistic Knowledge Enhanced Graph Transformer for Chinese Short Text Matching

arXiv.org Artificial Intelligence

Chinese short text matching is a fundamental task in natural language processing. Existing approaches usually take Chinese characters or words as input tokens. They have two limitations: 1) Some Chinese words are polysemous, and semantic information is not fully utilized. 2) Some models suffer potential issues caused by word segmentation. Here we introduce HowNet as an external knowledge base and propose a Linguistic knowledge Enhanced graph Transformer (LET) to deal with word ambiguity. Additionally, we adopt the word lattice graph as input to maintain multi-granularity information. Our model is also complementary to pre-trained language models. Experimental results on two Chinese datasets show that our models outperform various typical text matching approaches. Ablation study also indicates that both semantic information and multi-granularity information are important for text matching modeling.


Multi-channel Reverse Dictionary Model

arXiv.org Artificial Intelligence

A reverse dictionary takes the description of a target word as input and outputs the target word together with other words that match the description. Inspired by the description-to-word inference process of humans, we propose the multi-channel reverse dictionary model, which can mitigate the two problems simultaneously. Our model comprises a sentence encoder and multiple predictors. The predictors are expected to identify different characteristics of the target word from the input query. We evaluate our model on English and Chinese datasets including both dictionary definitions and human-written descriptions. Experimental results show that our model achieves the state-of-the-art performance, and even outperforms the most popular commercial reverse dictionary system on the human-written description dataset. We also conduct quantitative analyses and a case study to demonstrate the effectiveness and robustness of our model. All the code and data of this work can be obtained on https://github.com/thunlp/MultiRD. Introduction A regular (forward) dictionary maps words to definitions while a reverse dictionary (Sierra 2000) does the opposite and maps descriptions to corresponding words. In Figure 1, for example, a regular dictionary tells you that "expressway" is "a wide road that allows traffic to travel fast", and when you input "a road where cars go very quickly without stopping" to a reverse dictionary, it might return "expressway" together with other semantically similar words like "freeway". Reverse dictionaries have great practical value.


Towards Building a Multilingual Sememe Knowledge Base: Predicting Sememes for BabelNet Synsets

arXiv.org Artificial Intelligence

A sememe is defined as the minimum semantic unit of human languages. Sememe knowledge bases (KBs), which contain words annotated with sememes, have been successfully applied to many NLP tasks. However, existing sememe KBs are built on only a few languages, which hinders their widespread utilization. To address the issue, we propose to build a unified sememe KB for multiple languages based on BabelNet, a multilingual encyclopedic dictionary. We first build a dataset serving as the seed of the multilingual sememe KB. It manually annotates sememes for over $15$ thousand synsets (the entries of BabelNet). Then, we present a novel task of automatic sememe prediction for synsets, aiming to expand the seed dataset into a usable KB. We also propose two simple and effective models, which exploit different information of synsets. Finally, we conduct quantitative and qualitative analyses to explore important factors and difficulties in the task. All the source code and data of this work can be obtained on https://github.com/thunlp/BabelNet-Sememe-Prediction.


Open the Boxes of Words: Incorporating Sememes into Textual Adversarial Attack

arXiv.org Artificial Intelligence

Adversarial attack is carried out to reveal the vulnerability of deep neural networks. Word substitution is a class of effective adversarial textual attack method, which has been extensively explored. However, all existing studies utilize word embeddings or thesauruses to find substitutes. In this paper, we incorporate sememes, the minimum semantic units, into adversarial attack. We propose an efficient sememe-based word substitution strategy and integrate it into a genetic attack algorithm. In experiments, we employ our attack method to attack LSTM and BERT on both Chinese and English sentiment analysis as well as natural language inference benchmark datasets. Experimental results demonstrate our model achieves better attack success rates and less modification than the baseline methods based on word embedding or synonym. Furthermore, we find our attack model can bring more robustness enhancement to the target model with adversarial training.