Grammars & Parsing
SeSQL: Yet Another Large-scale Session-level Chinese Text-to-SQL Dataset
Huang, Saihao, Wang, Lijie, Li, Zhenghua, Liu, Zeyang, Dou, Chenhui, Yan, Fukang, Xiao, Xinyan, Wu, Hua, Zhang, Min
As the first session-level Chinese dataset, CHASE contains two separate parts, i.e., 2,003 sessions manually constructed from scratch (CHASE-C), and 3,456 sessions translated from English SParC (CHASE-T). We find the two parts are highly discrepant and incompatible as training and evaluation data. In this work, we present SeSQL, yet another large-scale session-level text-to-SQL dataset in Chinese, consisting of 5,028 sessions all manually constructed from scratch. In order to guarantee data quality, we adopt an iterative annotation workflow to facilitate intense and in-time review of previous-round natural language (NL) questions and SQL queries. Moreover, by completing all context-dependent NL questions, we obtain 27,012 context-independent question/SQL pairs, allowing SeSQL to be used as the largest dataset for single-round multi-DB text-to-SQL parsing. We conduct benchmark session-level text-to-SQL parsing experiments on SeSQL by employing three competitive session-level parsers, and present detailed analysis.
Learning and Compositionality: a Unification Attempt via Connectionist Probabilistic Programming
We consider learning and compositionality as the key mechanisms towards simulating human-like intelligence. While each mechanism is successfully achieved by neural networks and symbolic AIs, respectively, it is the combination of the two mechanisms that makes human-like intelligence possible. Despite the numerous attempts on building hybrid neuralsymbolic systems, we argue that our true goal should be unifying learning and compositionality, the core mechanisms, instead of neural and symbolic methods, the surface approaches to achieve them. In this work, we review and analyze the strengths and weaknesses of neural and symbolic methods by separating their forms and meanings (structures and semantics), and propose Connectionist Probabilistic Programs (CPPs), a framework that connects connectionist structures (for learning) and probabilistic program semantics (for compositionality). Under the framework, we design a CPP extension for small scale sequence modeling and provide a learning algorithm based on Bayesian inference. Although challenges exist in learning complex patterns without supervision, our early results demonstrate CPP's successful extraction of concepts and relations from raw sequential data, an initial step towards compositional learning.
g2pW: A Conditional Weighted Softmax BERT for Polyphone Disambiguation in Mandarin
Chen, Yi-Chang, Chang, Yu-Chuan, Chang, Yen-Cheng, Yeh, Yi-Ren
Polyphone disambiguation is the most crucial task in Mandarin grapheme-to-phoneme (g2p) conversion. Previous studies have approached this problem using pre-trained language models, restricted output, and extra information from Part-Of-Speech (POS) tagging. Inspired by these strategies, we propose a novel approach, called g2pW, which adapts learnable softmax-weights to condition the outputs of BERT with the polyphonic character of interest and its POS tagging. Rather than using the hard mask as in previous works, our experiments show that learning a soft-weighting function for the candidate phonemes benefits performance. In addition, our proposed g2pW does not require extra pre-trained POS tagging models while using POS tags as auxiliary features since we train the POS tagging model simultaneously with the unified encoder. Experimental results show that our g2pW outperforms existing methods on the public CPP dataset. All codes, model weights, and a user-friendly package are publicly available.
Parsing the Results of a Chaotic New York Primary
Four years ago, Alexandria Ocasio-Cortez's shock victory in a low-turnout midterm primary election in New York changed the shape of American politics. On Tuesday, the state held low-turnout midterm primaries with no such results. Instead, what the most-watched races offered was the latest glimpse of the ongoing fight between progressive insurgents and Democratic Party loyalists in New York. Loyalists claimed the day's biggest victories, thanks in large part to the state's new political maps--a consequence of a 2020 redistricting process that some of those same Party loyalists, led by then Governor Andrew Cuomo, botched so badly that a state judge ultimately outsourced the job to a postdoctoral fellow at Carnegie Mellon. The race that got the most attention, and which had the closest outcome, was for an open seat in the newly redrawn Tenth Congressional District, where the attorney Dan Goldman--who served as the House Democrats' lawyer during Donald Trump's first impeachment--squeaked out a victory in a crowded field.
Computational valency lexica and Homeric formularity
McGillivray, Barbara, Rodda, Martina Astrid
Distributional semantics, the quantitative study of meaning variation and change through corpus collocations, is currently one of the most productive research areas in computational linguistics. The wider availability of big data and of reproducible algorithms for analysis has boosted its application to living languages in recent years. But can we use distributional semantics to study a language with such a limited corpus as ancient Greek? And can this approach tell us something about such vexed questions in classical studies as the language and composition of the Homeric poems? Our paper will compare the semantic flexibility of formulae involving transitive verbs in archaic Greek epic to similar verb phrases in a non-formulaic corpus, in order to detect unique patterns of variation in formulae. To address this, we present AGVaLex, a computational valency lexicon for ancient Greek automatically extracted from the Ancient Greek Dependency Treebank. The lexicon contains quantitative corpus-driven morphological, syntactic and lexical information about verbs and their arguments, such as objects, subjects, and prepositional phrases, and has a wide range of applications for the study of the language of ancient Greek authors.
Recent Advances in Text-to-SQL: A Survey of What We Have and What We Expect
Deng, Naihao, Chen, Yulong, Zhang, Yue
Text-to-SQL has attracted attention from both the natural language processing and database communities because of its ability to convert the semantics in natural language into SQL queries and its practical application in building natural language interfaces to database systems. The major challenges in text-to-SQL lie in encoding the meaning of natural utterances, decoding to SQL queries, and translating the semantics between these two forms. These challenges have been addressed to different extents by the recent advances. However, there is still a lack of comprehensive surveys for this task. To this end, we review recent progress on text-to-SQL for datasets, methods, and evaluation and provide this systematic survey, addressing the aforementioned challenges and discussing potential future directions. We hope that this survey can serve as quick access to existing work and motivate future research.
Universal Caching
In learning theory, the performance of an online policy is commonly measured in terms of the static regret metric, which compares the cumulative loss of an online policy to that of an optimal benchmark in hindsight. In the definition of static regret, the action of the benchmark policy remains fixed throughout the time horizon. Naturally, the resulting regret bounds become loose in non-stationary settings where fixed actions often suffer from poor performance. In this paper, we investigate a stronger notion of regret minimization in the context of online caching. In particular, we allow the action of the benchmark at any round to be decided by a finite state machine containing any number of states. Popular caching policies, such as LRU and FIFO, belong to this class. Using ideas from the universal prediction literature in information theory, we propose an efficient online caching policy with a sub-linear regret bound. To the best of our knowledge, this is the first data-dependent regret bound known for the caching problem in the universal setting. We establish this result by combining a recently-proposed online caching policy with an incremental parsing algorithm, namely Lempel-Ziv '78. Our methods also yield a simpler learning-theoretic proof of the improved regret bound as opposed to the involved problem-specific combinatorial arguments used in the earlier works.
Judge a Sentence by Its Content to Generate Grammatical Errors
Data sparsity is a well-known problem for grammatical error correction (GEC). Generating synthetic training data is one widely proposed solution to this problem, and has allowed models to achieve state-of-the-art (SOTA) performance in recent years. However, these methods often generate unrealistic errors, or aim to generate sentences with only one error. We propose a learning based two stage method for synthetic data generation for GEC that relaxes this constraint on sentences containing only one error. Errors are generated in accordance with sentence merit. We show that a GEC model trained on our synthetically generated corpus outperforms models trained on synthetic data from prior work.
Effective Transfer Learning for Low-Resource Natural Language Understanding
Natural language understanding (NLU) is the task of semantic decoding of human languages by machines. NLU models rely heavily on large training data to ensure good performance. However, substantial languages and domains have very few data resources and domain experts. It is necessary to overcome the data scarcity challenge, when very few or even zero training samples are available. In this thesis, we focus on developing cross-lingual and cross-domain methods to tackle the low-resource issues. First, we propose to improve the model's cross-lingual ability by focusing on the task-related keywords, enhancing the model's robustness and regularizing the representations. We find that the representations for low-resource languages can be easily and greatly improved by focusing on just the keywords. Second, we present Order-Reduced Modeling methods for the cross-lingual adaptation, and find that modeling partial word orders instead of the whole sequence can improve the robustness of the model against word order differences between languages and task knowledge transfer to low-resource languages. Third, we propose to leverage different levels of domain-related corpora and additional masking of data in the pre-training for the cross-domain adaptation, and discover that more challenging pre-training can better address the domain discrepancy issue in the task knowledge transfer. Finally, we introduce a coarse-to-fine framework, Coach, and a cross-lingual and cross-domain parsing framework, X2Parser. Coach decomposes the representation learning process into a coarse-grained and a fine-grained feature learning, and X2Parser simplifies the hierarchical task structures into flattened ones. We observe that simplifying task structures makes the representation learning more effective for low-resource languages and domains.
Gender Bias and Universal Substitution Adversarial Attacks on Grammatical Error Correction Systems for Automated Assessment
Grammatical Error Correction (GEC) systems perform a sequence-to-sequence task, where an input word sequence containing grammatical errors, is corrected for these errors by the GEC system to output a grammatically correct word sequence. With the advent of deep learning methods, automated GEC systems have become increasingly popular. For example, GEC systems are often used on speech transcriptions of English learners as a form of assessment and feedback - these powerful GEC systems can be used to automatically measure an aspect of a candidate's fluency. The count of \textit{edits} from a candidate's input sentence (or essay) to a GEC system's grammatically corrected output sentence is indicative of a candidate's language ability, where fewer edits suggest better fluency. The count of edits can thus be viewed as a \textit{fluency score} with zero implying perfect fluency. However, although deep learning based GEC systems are extremely powerful and accurate, they are susceptible to adversarial attacks: an adversary can introduce a small, specific change at the input of a system that causes a large, undesired change at the output. When considering the application of GEC systems to automated language assessment, the aim of an adversary could be to cheat by making a small change to a grammatically incorrect input sentence that conceals the errors from a GEC system, such that no edits are found and the candidate is unjustly awarded a perfect fluency score. This work examines a simple universal substitution adversarial attack that non-native speakers of English could realistically employ to deceive GEC systems used for assessment.