hownet
Co-Driven Recognition of Semantic Consistency via the Fusion of Transformer and HowNet Sememes Knowledge
Chen, Fan, Huang, Yan, Zhang, Xinfang, Luo, Kang, Zhu, Jinxuan, He, Ruixian
Semantic consistency recognition aims to detect and judge whether the semantics of two text sentences are consistent with each other. However, the existing methods usually encounter the challenges of synonyms, polysemy and difficulty to understand long text. To solve the above problems, this paper proposes a co-driven semantic consistency recognition method based on the fusion of Transformer and HowNet sememes knowledge. Multi-level encoding of internal sentence structures via data-driven is carried out firstly by Transformer, sememes knowledge base HowNet is introduced for knowledge-driven to model the semantic knowledge association among sentence pairs. Then, interactive attention calculation is carried out utilizing soft-attention and fusion the knowledge with sememes matrix. Finally, bidirectional long short-term memory network (BiL-STM) is exploited to encode the conceptual semantic information and infer the semantic consistency. Experiments are conducted on two financial text matching datasets (BQ, AFQMC) and a cross-lingual adversarial dataset (PAWSX) for paraphrase identification. Compared with lightweight models including DSSM, MwAN, DRCN, and pre-training models such as ERNIE etc., the proposed model can not only improve the accuracy of semantic consistency recognition effectively (by 2.19%, 5.57% and 6.51% compared with the DSSM, MWAN and DRCN models on the BQ dataset), but also reduce the number of model parameters (to about 16M). In addition, driven by the HowNet sememes knowledge, the proposed method is promising to adapt to scenarios with long text.
Roof-Transformer: Divided and Joined Understanding with Knowledge Enhancement
Liao, Wei-Lin, Su, Cheng-En, Ma, Wei-Yun
Recent work on enhancing BERT-based language representation models with knowledge graphs (KGs) and knowledge bases (KBs) has yielded promising results on multiple NLP tasks. State-of-the-art approaches typically integrate the original input sentences with KG triples and feed the combined representation into a BERT model. However, as the sequence length of a BERT model is limited, such a framework supports little knowledge other than the original input sentences and is thus forced to discard some knowledge. This problem is especially severe for downstream tasks for which the input is a long paragraph or even a document, such as QA or reading comprehension tasks. We address this problem with Roof-Transformer, a model with two underlying BERTs and a fusion layer on top. One underlying BERT encodes the knowledge resources and the other one encodes the original input sentences, and the fusion layer integrates the two resultant encodings. Experimental results on a QA task and the GLUE benchmark attest the effectiveness of the proposed model.
Automatic Construction of Sememe Knowledge Bases via Dictionaries
Qi, Fanchao, Chen, Yangyi, Wang, Fengyu, Liu, Zhiyuan, Chen, Xiao, Sun, Maosong
A sememe is defined as the minimum semantic unit in linguistics. Sememe knowledge bases (SKBs), which comprise words annotated with sememes, enable sememes to be applied to natural language processing. So far a large body of research has showcased the unique advantages and effectiveness of SKBs in various tasks. However, most languages have no SKBs, and manual construction of SKBs is time-consuming and labor-intensive. To tackle this challenge, we propose a simple and fully automatic method of building an SKB via an existing dictionary. We use this method to build an English SKB and a French SKB, and conduct comprehensive evaluations from both intrinsic and extrinsic perspectives. Experimental results demonstrate that the automatically built English SKB is even superior to HowNet, the most widely used SKB that takes decades to build manually. And both the English and French SKBs can bring obvious performance enhancement in multiple downstream tasks. All the code and data of this paper (except the copyrighted dictionaries) can be obtained at https://github.com/thunlp/DictSKB.
LET: Linguistic Knowledge Enhanced Graph Transformer for Chinese Short Text Matching
Lyu, Boer, Chen, Lu, Zhu, Su, Yu, Kai
Chinese short text matching is a fundamental task in natural language processing. Existing approaches usually take Chinese characters or words as input tokens. They have two limitations: 1) Some Chinese words are polysemous, and semantic information is not fully utilized. 2) Some models suffer potential issues caused by word segmentation. Here we introduce HowNet as an external knowledge base and propose a Linguistic knowledge Enhanced graph Transformer (LET) to deal with word ambiguity. Additionally, we adopt the word lattice graph as input to maintain multi-granularity information. Our model is also complementary to pre-trained language models. Experimental results on two Chinese datasets show that our models outperform various typical text matching approaches. Ablation study also indicates that both semantic information and multi-granularity information are important for text matching modeling.
Towards Building a Multilingual Sememe Knowledge Base: Predicting Sememes for BabelNet Synsets
Qi, Fanchao, Chang, Liang, Sun, Maosong, Ouyang, Sicong, Liu, Zhiyuan
A sememe is defined as the minimum semantic unit of human languages. Sememe knowledge bases (KBs), which contain words annotated with sememes, have been successfully applied to many NLP tasks. However, existing sememe KBs are built on only a few languages, which hinders their widespread utilization. To address the issue, we propose to build a unified sememe KB for multiple languages based on BabelNet, a multilingual encyclopedic dictionary. We first build a dataset serving as the seed of the multilingual sememe KB. It manually annotates sememes for over $15$ thousand synsets (the entries of BabelNet). Then, we present a novel task of automatic sememe prediction for synsets, aiming to expand the seed dataset into a usable KB. We also propose two simple and effective models, which exploit different information of synsets. Finally, we conduct quantitative and qualitative analyses to explore important factors and difficulties in the task. All the source code and data of this work can be obtained on https://github.com/thunlp/BabelNet-Sememe-Prediction.
Incorporating Chinese Characters of Words for Lexical Sememe Prediction
Jin, Huiming, Zhu, Hao, Liu, Zhiyuan, Xie, Ruobing, Sun, Maosong, Lin, Fen, Lin, Leyu
Sememes are minimum semantic units of concepts in human languages, such that each word sense is composed of one or multiple sememes. Words are usually manually annotated with their sememes by linguists, and form linguistic common-sense knowledge bases widely used in various NLP tasks. Recently, the lexical sememe prediction task has been introduced. It consists of automatically recommending sememes for words, which is expected to improve annotation efficiency and consistency. However, existing methods of lexical sememe prediction typically rely on the external context of words to represent the meaning, which usually fails to deal with low-frequency and out-of-vocabulary words. To address this issue for Chinese, we propose a novel framework to take advantage of both internal character information and external context information of words. We experiment on HowNet, a Chinese sememe knowledge base, and demonstrate that our framework outperforms state-of-the-art baselines by a large margin, and maintains a robust performance even for low-frequency words.
Chinese LIWC Lexicon Expansion via Hierarchical Classification of Word Embeddings with Sememe Attention
Zeng, Xiangkai (Beihang University) | Yang, Cheng (Tsinghua University) | Tu, Cunchao (Tsinghua University) | Liu, Zhiyuan (Tsinghua University) | Sun, Maosong (Tsinghua University)
Linguistic Inquiry and Word Count (LIWC) is a word counting software tool which has been used for quantitative text analysis in many fields. Due to its success and popularity, the core lexicon has been translated into Chinese and many other languages. However, the lexicon only contains several thousand of words, which is deficient compared with the number of common words in Chinese. Current approaches often require manually expanding the lexicon, but it often takes too much time and requires linguistic experts to extend the lexicon. To address this issue, we propose to expand the LIWC lexicon automatically. Specifically, we consider it as a hierarchical classification problem and utilize the Sequence-to-Sequence model to classify words in the lexicon. Moreover, we use the sememe information with the attention mechanism to capture the exact meanings of a word, so that we can expand a more precise and comprehensive lexicon. The experimental results show that our model has a better understanding of word meanings with the help of sememes and achieves significant and consistent improvements compared with the state-of-the-art methods. The source code of this paper can be obtained from https://github.com/thunlp/Auto_CLIWC.