Grammars & Parsing
Measuring and Reducing Model Update Regression in Structured Prediction for NLP
Cai, Deng, Mansimov, Elman, Lai, Yi-An, Su, Yixuan, Shu, Lei, Zhang, Yi
Recent advance in deep learning has led to rapid adoption of machine learning based NLP models in a wide range of applications. Despite the continuous gain in accuracy, backward compatibility is also an important aspect for industrial applications, yet it received little research attention. Backward compatibility requires that the new model does not regress on cases that were correctly handled by its predecessor. This work studies model update regression in structured prediction tasks. We choose syntactic dependency parsing and conversational semantic parsing as representative examples of structured prediction tasks in NLP. First, we measure and analyze model update regression in different model update settings. Next, we explore and benchmark existing techniques for reducing model update regression including model ensemble and knowledge distillation. We further propose a simple and effective method, Backward-Congruent Re-ranking (BCR), by taking into account the characteristics of structured output. Experiments show that BCR can better mitigate model update regression than model ensemble and knowledge distillation approaches.
Neural Character-Level Syntactic Parsing for Chinese
Li, Zuchao, Zhou, Junru, Zhao, Hai, Zhang, Zhisong, Li, Haonan, Ju, Yuqi
In this work, we explore character-level neural syntactic parsing for Chinese with two typical syntactic formalisms: the constituent formalism and a dependency formalism based on a newly released character-level dependency treebank. Prior works in Chinese parsing have struggled with whether to de ne words when modeling character interactions. We choose to integrate full character-level syntactic dependency relationships using neural representations from character embeddings and richer linguistic syntactic information from human-annotated character-level Parts-Of-Speech and dependency labels. This has the potential to better understand the deeper structure of Chinese sentences and provides a better structural formalism for avoiding unnecessary structural ambiguities. Specifically, we first compare two different character-level syntax annotation styles: constituency and dependency. Then, we discuss two key problems for character-level parsing: (1) how to combine constituent and dependency syntactic structure in full character-level trees and (2) how to convert from character-level to word-level for both constituent and dependency trees. In addition, we also explore several other key parsing aspects, including di erent character-level dependency annotations and joint learning of Parts-Of-Speech and syntactic parsing. Finally, we evaluate our models on the Chinese Penn Treebank (CTB) and our published Shanghai Jiao Tong University Chinese Character Dependency Treebank (SCDT). The results show the e effectiveness of our model on both constituent and dependency parsing. We further provide empirical analysis and suggest several directions for future study.
Abstractions, Their Algorithms, and Their Compilers
Notice it is the second part that distinguishes abstractions in computer science from abstractions in other fields. Each abstraction thus allows us to design algorithms to manipulate data in certain specific ways. We want to design "good" abstractions, where the goodness of an abstraction is multidimensional. The ease with which an abstraction can be used to design solutions is one important metric. For example, we shall discuss in Section 3.1 how the relational model led to the proliferation in the use of databases. There are other performance metrics, such as the running time, on serial or parallel machines, of the resulting algorithms. Likewise, we favor abstractions that are easily implemented and that make it easy to create solutions to important problems. Finally, some abstractions offer a simple way to measure the efficiency of an algorithm (as we can find "big-oh" estimates of the running time of programs in a conventional programming language), while other abstractions require that we specify an implementation at a lower level before we can discuss algorithm efficiency, even approximately.
Static Analysis at GitHub
GitHub, a code-hosting website built atop the Git version-control system, hosts hundreds of millions of repositories of code uploaded by more than 65 million developers. The Semantic Code team at GitHub builds and operates a suite of technologies that power symbolic code navigation on github.com. Symbolic code navigation lets developers click on a named identifier in source code to navigate to the definition of that entity, as well as the reverse: given an identifier, they can list all the uses of that identifier within the project. This system is backed by a cloud object-storage service, having migrated from a multi-terabyte sharded relational database, and serves more than 40,000 requests per minute, across both read and write operations. The static analysis stage itself is built on an open source parsing toolkit called Tree-sitter, implements some well-known computer science research, and integrates with the github.com The system supports nine popular programming languages across six million repositories. Scaling even the most trivial of program analyses to this level entailed significant engineering effort, which is recounted here in the hope that it will serve as a useful guide for those scaling static analysis to large and rapidly changing codebases.
Log Parsing using Regular Expressions and Scala in Spark - Analytics Vidhya
This article was published as a part of the Data Science Blogathon. In this article, I am going to explain, how can we use log parsing with Spark and Scala to get meaningful data from unstructured data. In my experience, after parsing a lot of logs from different sources, I have found no data is unstructured. There is always some meaningful way to look at it and understand it. This is the way that we understand unstructured data.
Collocations in Parsing and Translation
Proper identification of collocations (and more generally of multiword expressions (MWEs), is an important qualitative step for several NLP applications and particularly so for translation. Since many MWEs cannot be translated literaly, failure to identify them yields at best inaccurate translation. This paper is mostly be concerned with collocations. We will show how they differ from other types of MWEs and how they can be successfully parsed and translated by means of a grammar-based parser and translator.
Part of Speech Tagging
Part of Speech (POS) is a way to describe the grammatical function of a word. In Natural Language Processing (NLP), POS is an essential building block of language models and interpreting text. While POS tags are used in higher-level functions of NLP, it's important to understand them on their own, and it's possible to leverage them for useful purposes in your text analysis. There are eight (sometimes nine) different parts of speech in English that are commonly defined. Noun: A noun is the name of a person, place, thing, or idea.
Learning Norms via Natural Language Teachings
To interact with humans, artificial intelligence (AI) systems must understand our social world. Within this world norms play an important role in motivating and guiding agents. However, very few computational theories for learning social norms have been proposed. There also exists a long history of debate on the distinction between what is normal (is) and what is normative (ought). Many have argued that being capable of learning both concepts and recognizing the difference is necessary for all social agents. This paper introduces and demonstrates a computational approach to learning norms from natural language text that accounts for both what is normal and what is normative. It provides a foundation for everyday people to train AI systems about social norms.
AstBERT: Enabling Language Model for Code Understanding with Abstract Syntax Tree
Liang, Rong, Lu, Yujie, Huang, Zhen, Zhang, Tiehua, Liu, Yuze
Using a pre-trained language model (i.e. BERT) to apprehend source codes has attracted increasing attention in the natural language processing community. However, there are several challenges when it comes to applying these language models to solve programming language (PL) related problems directly, the significant one of which is the lack of domain knowledge issue that substantially deteriorates the model's performance. To this end, we propose the AstBERT model, a pre-trained language model aiming to better understand the PL using the abstract syntax tree (AST). Specifically, we collect a colossal amount of source codes (both java and python) from GitHub and incorporate the contextual code knowledge into our model through the help of code parsers, in which AST information of the source codes can be interpreted and integrated. We verify the performance of the proposed model on code information extraction and code search tasks, respectively. Experiment results show that our AstBERT model achieves state-of-the-art performance on both downstream tasks (with 96.4% for code information extraction task, and 57.12% for code search task).
CPTAM: Constituency Parse Tree Aggregation Method
Kulkarni, Adithya, Sabetpour, Nasim, Markin, Alexey, Eulenstein, Oliver, Li, Qi
Diverse Natural Language Processing tasks employ constituency parsing to understand the syntactic structure of a sentence according to a phrase structure grammar. Many state-of-the-art constituency parsers are proposed, but they may provide different results for the same sentences, especially for corpora outside their training domains. This paper adopts the truth discovery idea to aggregate constituency parse trees from different parsers by estimating their reliability in the absence of ground truth. Our goal is to consistently obtain high-quality aggregated constituency parse trees. We formulate the constituency parse tree aggregation problem in two steps, structure aggregation and constituent label aggregation. Specifically, we propose the first truth discovery solution for tree structures by minimizing the weighted sum of Robinson-Foulds (RF) distances, a classic symmetric distance metric between two trees. Extensive experiments are conducted on benchmark datasets in different languages and domains. The experimental results show that our method, CPTAM, outperforms the state-of-the-art aggregation baselines. We also demonstrate that the weights estimated by CPTAM can adequately evaluate constituency parsers in the absence of ground truth.