Grammars & Parsing
Tracking Progress in Natural Language Processing
This document aims to track the progress in Natural Language Processing (NLP) and give an overview of the state-of-the-art (SOTA) across the most common NLP tasks and their corresponding datasets. It aims to cover both traditional and core NLP tasks such as dependency parsing and part-of-speech tagging as well as more recent ones such as reading comprehension and natural language inference. The main objective is to provide the reader with a quick overview of benchmark datasets and the state-of-the-art for their task of interest, which serves as a stepping stone for further research. To this end, if there is a place where results for a task are already published and regularly maintained, such as a public leaderboard, the reader will be pointed there. If you want to find this document again in the future, just go to nlpprogress.com
Establishing Strong Baselines for the New Decade: Sequence Tagging, Syntactic and Semantic Parsing with BERT
He, Han (Emory University ) | Choi, Jinho (Emory University)
This paper presents new state-of-the-art models for three tasks, part-of-speech tagging, syntactic parsing, and semantic parsing, using the cutting-edge contextualized embedding framework known as BERT. For each task, we first replicate and simplify the current state-of-the-art approach to enhance its model efficiency. We then evaluate our simplified approaches on those three tasks using token embeddings generated by BERT. 12 datasets in both English and Chinese are used for our experiments. The BERT models outperform the previously best-performing models by 2.5% on average (7.5% for the most significant case). All models and source codes are available in public so that researchers can improve upon and utilize them to establish strong baselines for the next decade.
Enhancing Answer Boundary Detection for Multilingual Machine Reading Comprehension
Yuan, Fei, Shou, Linjun, Bai, Xuanyu, Gong, Ming, Liang, Yaobo, Duan, Nan, Fu, Yan, Jiang, Daxin
Multilingual pre-trained models could leverage the training data from a rich source language (such as English) to improve performance on low resource languages. However, the transfer quality for multilingual Machine Reading Comprehension (MRC) is significantly worse than sentence classification tasks mainly due to the requirement of MRC to detect the word level answer boundary. In this paper, we propose two auxiliary tasks in the fine-tuning stage to create additional phrase boundary supervision: (1) A mixed MRC task, which translates the question or passage to other languages and builds cross-lingual question-passage pairs; (2) A language-agnostic knowledge masking task by leveraging knowledge phrases mined from web. Besides, extensive experiments on two cross-lingual MRC datasets show the effectiveness of our proposed approach.
Knowledge Graphs @ ICLR 2020
It's great to see more research and more datasets on complex QA and reasoning tasks. Whereas last year we saw a surge of multi-hop reading comprehension datasets (e.g., HotpotQA), this year at ICLR there is a strong line-up of papers dedicated to studying compositionality and logical complexity: and here KGs are of big help! Keysers et al study how to measure compositional generalization of QA models, i.e., when train and test splits operate on the same set of entities (broadly, logical atoms), but the composition of such atoms is different. The authors design a new large KGQA dataset CFQ (Compositional Freebase Questions) comprised of about 240K questions of 35K SPARQL query patterns. Several fascinating points 1) the questions are annotated with EL Description Logic (yes, those were the times around 2005 when DL meant mostly Description Logic, not Deep Learning); 2) as the dataset is positioned towards semantic parsing, all questions already have linked Freebase IDs (URIs), so you don't need to plug in your favourite Entity Linking system (like ElasticSearch).
An Imitation Game for Learning Semantic Parsers from User Interaction
Yao, Ziyu, Tang, Yiqi, Yih, Wen-tau, Sun, Huan, Su, Yu
Despite the widely successful applications, bootstrapping and fine-tuning semantic parsers are still a tedious process with challenges such as costly data annotation and privacy risks. In this paper, we suggest an alternative, human-in-the-loop methodology for learning semantic parsers directly from users. A semantic parser should be introspective of its uncertainties and prompt for user demonstration when uncertain. In doing so it also gets to imitate the user behavior and continue improving itself autonomously with the hope that eventually it may become as good as the user in interpreting their questions. To combat the sparsity of demonstration, we propose a novel annotation-efficient imitation learning algorithm, which iteratively collects new datasets by mixing demonstrated states and confident predictions and re-trains the semantic parser in a Dataset Aggregation fashion (Ross et al., 2011). We provide a theoretical analysis of its cost bound and also empirically demonstrate its promising performance on the text-to-SQL problem.
Synthetic vs. Real Reference Strings for Citation Parsing, and the Importance of Re-training and Out-Of-Sample Data for Meaningful Evaluations: Experiments with GROBID, GIANT and Cora
Citation parsing, particularly with deep neural networks, suffers from a lack of training data as available datasets typically contain only a few thousand training instances. Manually labelling citation strings is very time-consuming, hence synthetically created training data could be a solution. However, as of now, it is unknown if synthetically created reference-strings are suitable to train machine learning algorithms for citation parsing. To find out, we train Grobid, which uses Conditional Random Fields, with a) human-labelled reference strings from 'real' bibliographies and b) synthetically created reference strings from the GIANT dataset. We find that both synthetic and organic reference strings are equally suited for training Grobid (F1 = 0.74). We additionally find that retraining Grobid has a notable impact on its performance, for both synthetic and real data (+30% in F1). Having as many types of labelled fields as possible during training also improves effectiveness, even if these fields are not available in the evaluation data (+13.5% F1). We conclude that synthetic data is suitable for training (deep) citation parsing models. We further suggest that in future evaluations of reference parsers both evaluation data similar and dissimilar to the training data should be used for more meaningful evaluations.
AI4Narratives
Narratives are an important human tool for communication, representation and understanding. Natural Language Processing already offers many instruments that enable the automatic extraction of narrative elements from texts, including Named Entity Recognition, Semantic Role Labeling, Sentiment Analysis, Anaphora Resolution, Temporal Reasoning, etc. The storyfication of data is being used to generate textual reports on finance and sports, among others. Timelines and infographics can be employed to represent in a more compact way automatically identified narrative chains in a large set of news articles, assisting human readers in grasping complex stories with different moments and a network of characters. While the Automatic Generation of Text shows impressive results towards computational creativity, it still needs to develop means for controlling the narrative intent of the output.
A Methodology for Creating Question Answering Corpora Using Inverse Data Annotation
Deriu, Jan, Mlynchyk, Katsiaryna, Schläpfer, Philippe, Rodrigo, Alvaro, von Grünigen, Dirk, Kaiser, Nicolas, Stockinger, Kurt, Agirre, Eneko, Cieliebak, Mark
In this paper, we introduce a novel methodology to efficiently construct a corpus for question answering over structured data. For this, we introduce an intermediate representation that is based on the logical query plan in a database called Operation Trees (OT). This representation allows us to invert the annotation process without losing flexibility in the types of queries that we generate. Furthermore, it allows for fine-grained alignment of query tokens to OT operations. In our method, we randomly generate OTs from a context-free grammar. Afterwards, annotators have to write the appropriate natural language question that is represented by the OT. Finally, the annotators assign the tokens to the OT operations. We apply the method to create a new corpus OTTA (Operation Trees and Token Assignment), a large semantic parsing corpus for evaluating natural language interfaces to databases. We compare OTTA to Spider and LC-QuaD 2.0 and show that our methodology more than triples the annotation speed while maintaining the complexity of the queries. Finally, we train a state-of-the-art semantic parsing model on our data and show that our corpus is a challenging dataset and that the token alignment can be leveraged to increase the performance significantly.
Compositionality Decomposed: How do Neural Networks Generalise?
Hupkes, Dieuwke (University of Amsterdam) | Dankers, Verna (University of Amsterdam) | Mul, Mathijs | Bruni, Elia (UPF)
Despite a multitude of empirical studies, little consensus exists on whether neural networks are able to generalise compositionally, a controversy that, in part, stems from a lack of agreement about what it means for a neural model to be compositional. As a response to this controversy, we present a set of tests that provide a bridge between, on the one hand, the vast amount of linguistic and philosophical theory about compositionality of language and, on the other, the successful neural models of language. We collect different interpretations of compositionality and translate them into five theoretically grounded tests for models that are formulated on a task-independent level. In particular, we provide tests to investigate (i) if models systematically recombine known parts and rules (ii) if models can extend their predictions beyond the length they have seen in the training data (iii) if models' composition operations are local or global (iv) if models' predictions are robust to synonym substitutions and (v) if models favour rules or exceptions during training. To demonstrate the usefulness of this evaluation paradigm, we instantiate these five tests on a highly compositional data set which we dub PCFG SET and apply the resulting tests to three popular sequence-to-sequence models: a recurrent, a convolution-based and a transformer model. We provide an in-depth analysis of the results, which uncover the strengths and weaknesses of these three architectures and point to potential areas of improvement.