nl sentence
TAPEX: Table Pre-training via Learning a Neural SQL Executor
Liu, Qian, Chen, Bei, Guo, Jiaqi, Lin, Zeqi, Lou, Jian-guang
Recent years pre-trained language models hit a success on modeling natural language sentences and (semi-)structured tables. However, existing table pre-training techniques always suffer from low data quality and low pre-training efficiency. In this paper, we show that table pre-training can be realized by learning a neural SQL executor over a synthetic corpus, which is obtained by automatically synthesizing executable SQL queries. By pre-training on the synthetic corpus, our approach TAPEX dramatically improves the performance on downstream tasks, boosting existing language models by at most 19.5%. Meanwhile, TAPEX has remarkably high pre-training efficiency and yields strong results when using a small pre-trained corpus. Experimental results demonstrate that TAPEX outperforms previous table pre-training approaches by a large margin, and our model achieves new state-of-the-art results on four well-known datasets, including improving the WikiSQL denotation accuracy to 89.6% (+4.9%), the WikiTableQuestions denotation accuracy to 57.5% (+4.8%), the SQA denotation accuracy to 74.5% (+3.5%), and the TabFact accuracy to 84.6% (+3.6%). Our work opens the way to reason over structured data by pre-training on synthetic executable programs.
An Extended GHKM Algorithm for Inducing Lambda-SCFG
Li, Peng (Tsinghua University, China) | Liu, Yang | Sun, Maosong
Semantic parsing, which aims at mapping a natural language (NL) sentence into its formal meaning representation (e.g., logical form), has received increasing attention in recent years. While synchronous context-free grammar (SCFG) augmented with lambda calculus (lambda-SCFG) provides an effective mechanism for semantic parsing, how to learn such lambda-SCFG rules still remains a challenge because of the difficulty in determining the correspondence between NL sentences and logical forms. To alleviate this structural divergence problem, we extend the GHKM algorithm, which is a state-of-the-art algorithm for learning synchronous grammars in statistical machine translation, to induce lambda-SCFG from pairs of NL sentences and logical forms. By treating logical forms as trees, we reformulate the theory behind GHKM that gives formal semantics to the alignment between NL words and logical form tokens. Experiments on the GEOQUERY dataset show that our semantic parser achieves an F-measure of 90.2%, the best result published to date.
Training a Multilingual Sportscaster: Using Perceptual Context to Learn Language
Chen, D. L., Kim, J., Mooney, R. J.
We present a novel framework for learning to interpret and generate language using only perceptual context as supervision. We demonstrate its capabilities by developing a system that learns to sportscast simulated robot soccer games in both English and Korean without any language-specific prior knowledge. Training employs only ambiguous supervision consisting of a stream of descriptive textual comments and a sequence of events extracted from the simulation trace. The system simultaneously establishes correspondences between individual comments and the events that they describe while building a translation model that supports both parsing and generation. We also present a novel algorithm for learning which events are worth describing. Human evaluations of the generated commentaries indicate they are of reasonable quality and in some cases even on par with those produced by humans for our limited domain.