Grammars & Parsing
Build a traceable, custom, multi-format document parsing pipeline with Amazon Textract
Organizational forms serve as a primary business tool across industries--from financial services, to healthcare, and more. Consider, for example, tax filing forms in the tax management industry, where new forms come out each year with largely the same information. AWS customers across sectors need to process and store information in forms as part of their daily business practice. These forms often serve as a primary means for information to flow into an organization where technological means of data capture are impractical. In addition to using forms to capture information, over the years of offering Amazon Textract, we have observed that AWS customers frequently version their organizational forms based on structural changes made, fields added or changed, or other considerations such as a change of year or version of the form.
Towards Lithuanian grammatical error correction
Stankevičius, Lukas, Lukoševičius, Mantas
Everyone wants to write beautiful and correct text, yet the lack of language skills, experience, or hasty typing can result in errors. By employing the recent advances in transformer architectures, we construct a grammatical error correction model for Lithuanian, the language rich in archaic features. We compare subword and byte-level approaches and share our best trained model, achieving F$_{0.5}$=0.92, and accompanying code, in an online open-source repository.
KinyaBERT: a Morphology-aware Kinyarwanda Language Model
Nzeyimana, Antoine, Rubungo, Andre Niyongabo
Pre-trained language models such as BERT have been successful at tackling many natural language processing tasks. However, the unsupervised sub-word tokenization methods commonly used in these models (e.g., byte-pair encoding - BPE) are sub-optimal at handling morphologically rich languages. Even given a morphological analyzer, naive sequencing of morphemes into a standard BERT architecture is inefficient at capturing morphological compositionality and expressing word-relative syntactic regularities. We address these challenges by proposing a simple yet effective two-tier BERT architecture that leverages a morphological analyzer and explicitly represents morphological compositionality. Despite the success of BERT, most of its evaluations have been conducted on high-resource languages, obscuring its applicability on low-resource languages. We evaluate our proposed method on the low-resource morphologically rich Kinyarwanda language, naming the proposed model architecture KinyaBERT. A robust set of experimental results reveal that KinyaBERT outperforms solid baselines by 2% in F1 score on a named entity recognition task and by 4.3% in average score of a machine-translated GLUE benchmark. KinyaBERT fine-tuning has better convergence and achieves more robust results on multiple tasks even in the presence of translation noise.
Thomason
Intelligent robots frequently need to understand requests from naive users through natural language. Previous approaches either cannot account for language variation, e.g., keyword search, or require gathering large annotated corpora, which can be expensive and cannot adapt to new variation. We introduce a dialog agent for mobile robots that understands human instructions through semantic parsing, actively resolves ambiguities using a dialog manager, and incrementally learns from human-robot conversations by inducing training data from user paraphrases. Our dialog agent is implemented and tested both on a web interface with hundreds of users via Mechanical Turk and on a mobile robot over several days, tasked with understanding navigation and delivery requests through natural language in an office environment. In both contexts, We observe significant improvements in user satisfaction after learning from conversations.
Li
Text normalization and part-of-speech (POS) tagging for social media data have been investigated recently, however, prior work has treated them separately. In this paper, we propose a joint Viterbi decoding process to determine each token's POS tag and non-standard token's correct form at the same time. In order to evaluate our approach, we create two new data sets with POS tag labels and non-standard tokens' correct forms. This is the first data set with such annotation.
Zheng
We describe a novel convolutional neural network architecture with k-max pooling layer that is able to successfully recover the structure of Chinese sentences. This network can capture active features for unseen segments of a sentence to measure how likely the segments are merged to be the constituents. Given an input sentence, after all the scores of possible segments are computed, an efficient dynamic programming parsing algorithm is used to find the globally optimal parse tree. A similar network is then applied to predict syntactic categories for every node in the parse tree.
Jiang
Cross-lingual induction aims to acquire for one language some linguistic structures resorting to annotations from another language. It works well for simple structured predication problems such as part-of-speech tagging and dependency parsing, but lacks of significant progress for more complicated problems such as constituency parsing and deep semantic parsing, mainly due to the structural non-isomorphism between languages. We propose a decomposed projection strategy for cross-lingual induction, where cross-lingual projection is performed in unit of fundamental decisions of the structured predication. Compared with the structured projection that projects the complete structures, decomposed projection achieves better adaptation of non-isomorphism between languages and efficiently acquires the structured information across languages, thus leading to better performance. For joint cross-lingual induction of constituency and dependency grammars, decomposed cross-lingual induction achieves very significant improvement in both constituency and dependency grammar induction.
He
This paper presents new state-of-the-art models for three tasks, part-of-speech tagging, syntactic parsing, and semantic parsing, using the cutting-edge contextualized embedding framework known as BERT. For each task, we first replicate and simplify the current state-of-the-art approach to enhance its model efficiency. We then evaluate our simplified approaches on those three tasks using token embeddings generated by BERT. The BERT models outperform the previously best-performing models by 2.5% on average (7.5% for the most significant case). All models and source codes are available in public so that researchers can improve upon and utilize them to establish strong baselines for the next decade.
Ryan
Natural language generation (NLG) has been featured in at most a handful of shipped games and interactive stories. This is certainly due to it being a very specialized practice, but another contributing factor is that the state of the art today, in terms of content quality, is simply inadequate. The major benefits of NLG are its alleviation of authorial burden and the capability it gives to a system of generating state-bespoke content, but we believe we can have these benefits without actually employing a full NLG pipeline. In this paper, we present the preliminary design of Expressionist, an in-development mixed-initiative authoring tool that instantiates an authoring scheme residing somewhere between conventional NLG and conventional human content authoring. In this scheme, a human author plays the part of an NLG module in that she starts from a set of deep representations constructed for the game or story domain and proceeds to specify dialogic content that may express those representations. Rather than authoring static dialogue, the author defines a probabilistic context-free grammar that yields templated dialogue. This allows a human author to still harness a computer's generativity, but in a capacity in which it can be trusted: operating over probabilities and treelike control structures. Additional features of Expressionist's design include arbitrary markup and realtime feedback showing currently valid derivations.