Grammars & Parsing
Photon: A Robust Cross-Domain Text-to-SQL System
Zeng, Jichuan, Lin, Xi Victoria, Xiong, Caiming, Socher, Richard, Lyu, Michael R., King, Irwin, Hoi, Steven C. H.
Natural language interfaces to databases (NLIDB) democratize end user access to relational data. Due to fundamental differences between natural language communication and programming, it is common for end users to issue questions that are ambiguous to the system or fall outside the semantic scope of its underlying query language. We present Photon, a robust, modular, cross-domain NLIDB that can flag natural language input to which a SQL mapping cannot be immediately determined. Photon consists of a strong neural semantic parser (63.2\% structure accuracy on the Spider dev benchmark), a human-in-the-loop question corrector, a SQL executor and a response generator. The question corrector is a discriminative neural sequence editor which detects confusion span(s) in the input question and suggests rephrasing until a translatable input is given by the user or a maximum number of iterations are conducted. Experiments on simulated data show that the proposed method effectively improves the robustness of text-to-SQL system against untranslatable user input. The live demo of our system is available at http://naturalsql.com.
Text Complexity Classification Based on Linguistic Information: Application to Intelligent Tutoring of ESL
The goal of this work is to build a classifier that can identify text complexity within the context of teaching reading to English as a Second Language (ESL) learners. To present language learners with texts that are suitable to their level of English, a set of features that can describe the phonological, morphological, lexical, syntactic, discursive, and psychological complexity of a given text were identified. Using a corpus of 6171 texts, which had already been classified into three different levels of difficulty by ESL experts, different experiments were conducted with five machine learning algorithms. The results showed that the adopted linguistic features provide a good overall classification performance (F-Score = 0.97). A scalability evaluation was conducted to test if such a classifier could be used within real applications, where it can be, for example, plugged into a search engine or a web-scraping module. In this evaluation, the texts in the test set are not only different from those from the training set but also of different types (ESL texts vs. children reading texts). Although the overall performance of the classifier decreased significantly (F-Score = 0.65), the confusion matrix shows that most of the classification errors are between the classes two and three (the middle-level classes) and that the system has a robust performance in categorizing texts of class one and four. This behavior can be explained by the difference in classification criteria between the two corpora. Hence, the observed results confirm the usability of such a classifier within a real-world application.
Closed Loop Neural-Symbolic Learning via Integrating Neural Perception, Grammar Parsing, and Symbolic Reasoning
Li, Qing, Huang, Siyuan, Hong, Yining, Chen, Yixin, Wu, Ying Nian, Zhu, Song-Chun
The goal of neural-symbolic computation is to integrate the connectionist and symbolist paradigms. Prior methods learn the neural-symbolic models using reinforcement learning (RL) approaches, which ignore the error propagation in the symbolic reasoning module and thus converge slowly with sparse rewards. In this paper, we address these issues and close the loop of neural-symbolic learning by (1) introducing the \textbf{grammar} model as a \textit{symbolic prior} to bridge neural perception and symbolic reasoning, and (2) proposing a novel \textbf{back-search} algorithm which mimics the top-down human-like learning procedure to propagate the error through the symbolic reasoning module efficiently. We further interpret the proposed learning framework as maximum likelihood estimation using Markov chain Monte Carlo sampling and the back-search algorithm as a Metropolis-Hastings sampler. The experiments are conducted on two weakly-supervised neural-symbolic tasks: (1) handwritten formula recognition on the newly introduced HWF dataset; (2) visual question answering on the CLEVR dataset. The results show that our approach significantly outperforms the RL methods in terms of performance, converging speed, and data efficiency. Our code and data are released at \url{https://liqing-ustc.github.io/NGS}.
Parts of Speech Tagging in NLP: Runtime Optimization with Quantum Formulation and ZX Calculus
Bishwas, Arit Kumar, Mani, Ashish, Palade, Vasile
Many organizations are claiming their stacks in this space [1][2][3][4]. In today's world, the available quantum computers are at very early stages and not capable of handling complex quantum artificial intelligence/machine learning (qAI/qML) tasks [5]. But we still can harness their properties to run some of our quantum AI/ML algorithms more efficiently. In this sense, we can use the "Noisy Intermediate Scale Quantum Systems" (NISQ) [6] to serve the purpose. We can run the less complex quantum subroutines of a big qAI/qML in these kinds of quantum computers and use the results in the main qAI/qML problem-solving pipeline. This way we create a classical-quantum hybrid problem-solving ecosystem in AI/ML space.
TaBERT: A new model for understanding queries over tabular data
TaBERT is the first model that has been pretrained to learn representations for both natural language sentences and tabular data. These sorts of representations are useful for natural language understanding tasks that involve joint reasoning over natural language sentences and tables. A representative example is semantic parsing over databases, where a natural language question (e.g., "Which country has the highest GDP?") is mapped to a program executable over database (DB) tables. This is the first pretraining approach across structured and unstructured domains, and it opens new possibilities regarding semantic parsing, where one of the key challenges has been understanding the structure of a DB table and how it aligns with a query. TaBERT has been trained using a corpus of 26 million tables and their associated English sentences.
Intelligent requirements engineering from natural language and their chaining toward CAD models
Fougères, Alain-Jérôme, Ostrosi, Egon
This paper assumes that design language plays an important role in how designers design and on the creativity of designers. Designers use and develop models as an aid to thinking, a focus for discussion and decision-making and a means of evaluating the reliability of the proposals. This paper proposes an intelligent method for requirements engineering from natural language and their chaining toward CAD models. The transition from linguistic analysis to the representation of engineering requirements consists of the translation of the syntactic structure into semantic form represented by conceptual graphs. Based on the isomorphism between conceptual graphs and predicate logic, a formal language of the specification is proposed. The outcome of this language is chained and translated in Computer Aided Three-Dimensional Interactive Application (CATIA) models. The tool (EGEON: Engineering desiGn sEmantics elabOration and applicatioN) is developed to represent the semantic network of engineering requirements. A case study on the design of a car door hinge is presented to illustrates the proposed method.
Syn-QG: Syntactic and Shallow Semantic Rules for Question Generation
Dhole, Kaustubh D., Manning, Christopher D.
Question Generation (QG) is fundamentally a simple syntactic transformation; however, many aspects of semantics influence what questions are good to form. We implement this observation by developing Syn-QG, a set of transparent syntactic rules leveraging universal dependencies, shallow semantic parsing, lexical resources, and custom rules which transform declarative sentences into question-answer pairs. We utilize PropBank argument descriptions and VerbNet state predicates to incorporate shallow semantic content, which helps generate questions of a descriptive nature and produce inferential and semantically richer questions than existing systems. In order to improve syntactic fluency and eliminate grammatically incorrect questions, we employ back-translation over the output of these syntactic rules. A set of crowd-sourced evaluations shows that our system can generate a larger number of highly grammatical and relevant questions than previous QG systems and that back-translation drastically improves grammaticality at a slight cost of generating irrelevant questions.
Roadmap to Natural Language Processing (NLP)
Natural Language Processing (NLP) is the area of research in Artificial Intelligence focused on processing and using Text and Speech data to create smart machines and create insights. One of nowadays most interesting NLP application is creating machines able to discuss with humans about complex topics. IBM Project Debater represents so far one of the most successful approaches in this area. All of these preprocessing techniques can be easily applied to different types of texts using standard Python NLP libraries such as NLTK and Spacy. Additionally, in order to extrapolate the language syntax and structure of our text, we can make use of techniques such as Parts of Speech (POS) Tagging and Shallow Parsing (Figure 1).
The First Shared Task on Discourse Representation Structure Parsing
Abzianidze, Lasha, van Noord, Rik, Haagsma, Hessel, Bos, Johan
The paper presents the IWCS 2019 shared task on semantic parsing where the goal is to produce Discourse Representation Structures (DRSs) for English sentences. DRSs originate from Discourse Representation Theory and represent scoped meaning representations that capture the semantics of negation, modals, quantification, and presupposition triggers. Additionally, concepts and event-participants in DRSs are described with WordNet synsets and the thematic roles from VerbNet. To measure similarity between two DRSs, they are represented in a clausal form, i.e. as a set of tuples. Participant systems were expected to produce DRSs in this clausal form. Taking into account the rich lexical information, explicit scope marking, a high number of shared variables among clauses, and highly-constrained format of valid DRSs, all these makes the DRS parsing a challenging NLP task. The results of the shared task displayed improvements over the existing state-of-the-art parser.
A Complex KBQA System using Multiple Reasoning Paths
Qin, Kechen, Wang, Yu, Li, Cheng, Gunaratna, Kalpa, Jin, Hongxia, Pavlu, Virgil, Aslam, Javed A.
Multi-hop knowledge based question answering (KBQA) is a complex task for natural language understanding. Many KBQA approaches have been proposed in recent years, and most of them are trained based on labeled reasoning path. This hinders the system's performance as many correct reasoning paths are not labeled as ground truth, and thus they cannot be learned. In this paper, we introduce an end-to-end KBQA system which can leverage multiple reasoning paths' information and only requires labeled answer as supervision. We conduct experiments on several benchmark datasets containing both single-hop simple questions as well as muti-hop complex questions, including WebQuestionSP (WQSP), ComplexWebQuestion-1.1 (CWQ), and PathQuestion-Large (PQL), and demonstrate strong performance.