Goto

Collaborating Authors

 Grammars & Parsing


Dependency Parsing as MRC-based Span-Span Prediction

arXiv.org Artificial Intelligence

Higher-order methods for dependency parsing can partially but not fully addresses the issue that edges in dependency tree should be constructed at the text span/subtree level rather than word level. % This shortcoming can cause an incorrect span covered the corresponding tree rooted at a certain word though the word is correctly linked to its head. In this paper, we propose a new method for dependency parsing to address this issue. The proposed method constructs dependency trees by directly modeling span-span (in other words, subtree-subtree) relations. It consists of two modules: the {\it text span proposal module} which proposes candidate text spans, each of which represents a subtree in the dependency tree denoted by (root, start, end); and the {\it span linking module}, which constructs links between proposed spans. We use the machine reading comprehension (MRC) framework as the backbone to formalize the span linking module in an MRC setup, where one span is used as a query to extract the text span/subtree it should be linked to. The proposed method comes with the following merits: (1) it addresses the fundamental problem that edges in a dependency tree should be constructed between subtrees; (2) the MRC framework allows the method to retrieve missing spans in the span proposal stage, which leads to higher recall for eligible spans. Extensive experiments on the PTB, CTB and Universal Dependencies (UD) benchmarks demonstrate the effectiveness of the proposed method. We are able to achieve new SOTA performances on PTB and UD benchmarks, and competitive performances to previous SOTA models on the CTB dataset. Code is available at https://github.com/ShannonAI/mrc-for-dependency-parsing.


Doing Natural Language Processing in A Natural Way: An NLP toolkit based on object-oriented knowledge base and multi-level grammar base

arXiv.org Artificial Intelligence

We introduce an NLP toolkit based on object-oriented knowledge base and multi-level grammar base. This toolkit focuses on semantic parsing, it also has abilities to discover new knowledge and grammar automatically, new discovered knowledge and grammar will be identified by human, and will be used to update the knowledge base and grammar base. This process can be iterated many times to improve the toolkit continuously.


Neural Text Generation with Part-of-Speech Guided Softmax

arXiv.org Artificial Intelligence

Neural text generation models are likely to suffer from the low-diversity problem. Various decoding strategies and training-based methods have been proposed to promote diversity only by exploiting contextual features, but rarely do they consider incorporating syntactic structure clues. In this work, we propose using linguistic annotation, i.e., part-of-speech (POS), to guide the text generation. In detail, we introduce POS Guided Softmax (POSG-Softmax) to explicitly model two posterior probabilities: (i) next-POS, and (ii) next-token from the vocabulary of the target POS. A POS guided sampling strategy is further proposed to address the low-diversity problem by enriching the diversity of POS. Extensive experiments and human evaluations demonstrate that, compared with existing state-of-the-art methods, our proposed methods can generate more diverse text while maintaining comparable quality.


PingAn-VCGroup's Solution for ICDAR 2021 Competition on Scientific Literature Parsing Task B: Table Recognition to HTML

arXiv.org Artificial Intelligence

The ICDAR 2021 competition on scientific literature parsing task B is to reconstruct the table image into an HTML code. In this competition, PubTabNet dataset (v2.0.0) [3] is provided as the official evaluation data, and Tree-Edit-Distance-based similarity (TEDS) metric is used for evaluation. The PubTabNet data set consists of 500,777 training samples, 9,115 validation samples, 9,138 samples for the development stage, and 9,064 samples for the final evaluation stage. For the training and validation data, the ground truth HTML codes and the position of non-empty table cells are provided to the participants. Participants of this competition need to develop a model that can convert images of tabular data into the corresponding HTML code, which should correctly represent the structure of the table and the content of each cell. The labels of samples for the development and the final evaluation stages are preserved by the organizers. We divide this task into four sub-tasks: table structure recognition, text line detection, text line recognition, and box assignment. And several tricks are tried to improve the model. The details of each sub-task will be discussed in the following section.


NLP - Natural Language Processing with Python

#artificialintelligence

Welcome to the best Natural Language Processing course on the internet! This course is designed to be your complete online resource for learning how to use Natural Language Processing with the Python programming language. In the course we will cover everything you need to learn in order to become a world class practitioner of NLP with Python. We'll start off with the basics, learning how to open and work with text and PDF files with Python, as well as learning how to use regular expressions to search for custom patterns inside of text files. Afterwards we will begin with the basics of Natural Language Processing, utilizing the Natural Language Toolkit library for Python, as well as the state of the art Spacy library for ultra fast tokenization, parsing, entity recognition, and lemmatization of text.


Learning Syntax from Naturally-Occurring Bracketings

arXiv.org Artificial Intelligence

Naturally-occurring bracketings, such as answer fragments to natural language questions and hyperlinks on webpages, can reflect human syntactic intuition regarding phrasal boundaries. Their availability and approximate correspondence to syntax make them appealing as distant information sources to incorporate into unsupervised constituency parsing. But they are noisy and incomplete; to address this challenge, we develop a partial-brackets-aware structured ramp loss in learning. Experiments demonstrate that our distantly-supervised models trained on naturally-occurring bracketing data are more accurate in inducing syntactic structures than competing unsupervised systems. On the English WSJ corpus, our models achieve an unlabeled F1 score of 68.9 for constituency parsing.


Diversity-Aware Batch Active Learning for Dependency Parsing

arXiv.org Artificial Intelligence

While the predictive performance of modern statistical dependency parsers relies heavily on the availability of expensive expert-annotated treebank data, not all annotations contribute equally to the training of the parsers. In this paper, we attempt to reduce the number of labeled examples needed to train a strong dependency parser using batch active learning (AL). In particular, we investigate whether enforcing diversity in the sampled batches, using determinantal point processes (DPPs), can improve over their diversity-agnostic counterparts. Simulation experiments on an English newswire corpus show that selecting diverse batches with DPPs is superior to strong selection strategies that do not enforce batch diversity, especially during the initial stages of the learning process. Additionally, our diversityaware strategy is robust under a corpus duplication setting, where diversity-agnostic sampling strategies exhibit significant degradation.


Sattiy at SemEval-2021 Task 9: An Ensemble Solution for Statement Verification and Evidence Finding with Tables

arXiv.org Artificial Intelligence

Question answering from semi-structured tables can be seen as a semantic parsing task and is significant and practical for pushing the boundary of natural language understanding. Existing research mainly focuses on understanding contents from unstructured evidence, e.g., news, natural language sentences, and documents. The task of verification from structured evidence, such as tables, charts, and databases, is still less explored. This paper describes sattiy team's system in SemEval-2021 task 9: Statement Verification and Evidence Finding with Tables (SEM-TAB-FACT). This competition aims to verify statements and to find evidence from tables for scientific articles and to promote the proper interpretation of the surrounding article. In this paper, we exploited ensemble models of pre-trained language models over tables, TaPas and TaBERT, for Task A and adjust the result based on some rules extracted for Task B. Finally, in the leaderboard, we attain the F1 scores of 0.8496 and 0.7732 in Task A for the 2-way and 3-way evaluation, respectively, and the F1 score of 0.4856 in Task B.


Evaluating the Impact of a Hierarchical Discourse Representation on Entity Coreference Resolution Performance

arXiv.org Artificial Intelligence

The contribution of this paper is an empirical investigation of the impact of including a representation Historically, theories of discourse coherence of the hierarchical structure of discourse within (Chafe, 1976; Hobbs, 1979; Grosz and a neural entity coreference approach. To this end, Sidner, 1986; Clark and Brennan, 1991) have offered we leverage a state-of-the-art RST discourse-parser elaborate expositions on how the patterns of to convert a flat document into a tree-like structure anaphoric references in discourse are constrained from which we can derive features that model the by limitations in human capacity to manage structural constraints. We embed this representation attention and resolve ambiguity. Hobbs (1979) within an architecture that is enabled to learn to acknowledges that these human limitations have use this information deferentially depending upon meant that coreference resolution in natural text the type of mention. The results demonstrate that can be achieved with relatively high accuracy using this level of nuance enables a small but significant a combination of recency and simple semantic improvement in coreference accuracy, even with constraints. State-of-the-art neural approaches for automatically constructed RST trees.


Natural Language Generation Using Link Grammar for General Conversational Intelligence

arXiv.org Artificial Intelligence

Many current artificial general intelligence (AGI) and natural language processing (NLP) architectures do not possess general conversational intelligence--that is, they either do not deal with language or are unable to convey knowledge in a form similar to the human language without manual, labor-intensive methods such as template-based customization. In this paper, we propose a new technique to automatically generate grammatically valid sentences using the Link Grammar database. This natural language generation method far outperforms current state-of-the-art baselines and may serve as the final component in a proto-AGI question answering pipeline that understandably handles natural language material.