Goto

Collaborating Authors

 Grammars & Parsing


Controlled Text Generation with Natural Language Instructions

arXiv.org Artificial Intelligence

Large language models generate fluent texts and can follow natural language instructions to solve a wide range of tasks without task-specific training. Nevertheless, it is notoriously difficult to control their generation to satisfy the various constraints required by different applications. In this work, we present InstructCTG, a controlled text generation framework that incorporates different constraints by conditioning on natural language descriptions and demonstrations of the constraints. In particular, we first extract the underlying constraints of natural texts through a combination of off-the-shelf NLP tools and simple heuristics. We then verbalize the constraints into natural language instructions to form weakly supervised training data. By prepending natural language descriptions of the constraints and a few demonstrations, we fine-tune a pre-trained language model to incorporate various types of constraints. Compared to existing search-based or score-based methods, InstructCTG is more flexible to different constraint types and has a much smaller impact on the generation quality and speed because it does not modify the decoding procedure. Additionally, InstructCTG allows the model to adapt to new constraints without re-training through the use of few-shot task generalization and in-context learning abilities of instruction-tuned language models.


Privacy-Preserving Domain Adaptation of Semantic Parsers

arXiv.org Artificial Intelligence

To mitigate that problem, Differentially Private In task-oriented dialogue systems, such as Siri and (DP) training algorithms, such as DP-SGD (Abadi Alexa, a software agent parses a user's intent into et al., 2016; Dwork et al., 2006), can be used to a program, executes it and then communicates the provide worst-case guarantees on the information results back to the user (Andreas et al., 2020; Li leakage of a trained model. This guarantee is et al., 2022; Cheng et al., 2020; Gupta et al., 2018; controlled by the privacy budget ϵ, where lower Young et al., 2013). As a result of their growing epsilon means higher privacy. But while DP-SGD popularity, these systems face an increasing could be used to adapt (fine-tune) a semantic parser demand to improve their linguistic coverage (How on unannotated private data, there is a limit to what do users talk?) as well as functional coverage can be done in this way. Even if some users are (What are users trying to do?). An input utterance asking the system to hop up and down, fine-tuning to such a system could look like this: "Could you is unlikely to make it grow legs.


A Systematic Review of Automated Query Reformulations in Source Code Search

arXiv.org Artificial Intelligence

Fixing software bugs and adding new features are two of the major maintenance tasks. Software bugs and features are reported as change requests. Developers consult these requests and often choose a few keywords from them as an ad hoc query. Then they execute the query with a search engine to find the exact locations within software code that need to be changed. Unfortunately, even experienced developers often fail to choose appropriate queries, which leads to costly trials and errors during a code search. Over the years, many studies attempt to reformulate the ad hoc queries from developers to support them. In this systematic literature review, we carefully select 70 primary studies on query reformulations from 2,970 candidate studies, perform an in-depth qualitative analysis (e.g., Grounded Theory), and then answer seven research questions with major findings. First, to date, eight major methodologies (e.g., term weighting, term co-occurrence analysis, thesaurus lookup) have been adopted to reformulate queries. Second, the existing studies suffer from several major limitations (e.g., lack of generalizability, vocabulary mismatch problem, subjective bias) that might prevent their wide adoption. Finally, we discuss the best practices and future opportunities to advance the state of research in search query reformulations.


XSemPLR: Cross-Lingual Semantic Parsing in Multiple Natural Languages and Meaning Representations

arXiv.org Artificial Intelligence

Cross-Lingual Semantic Parsing (CLSP) aims to translate queries in multiple natural languages (NLs) into meaning representations (MRs) such as SQL, lambda calculus, and logic forms. However, existing CLSP models are separately proposed and evaluated on datasets of limited tasks and applications, impeding a comprehensive and unified evaluation of CLSP on a diverse range of NLs and MRs. To this end, we present XSemPLR, a unified benchmark for cross-lingual semantic parsing featured with 22 natural languages and 8 meaning representations by examining and selecting 9 existing datasets to cover 5 tasks and 164 domains. We use XSemPLR to conduct a comprehensive benchmark study on a wide range of multilingual language models including encoder-based models (mBERT, XLM-R), encoder-decoder models (mBART, mT5), and decoder-based models (Codex, BLOOM). We design 6 experiment settings covering various lingual combinations (monolingual, multilingual, cross-lingual) and numbers of learning samples (full dataset, few-shot, and zero-shot). Our experiments show that encoder-decoder models (mT5) achieve the highest performance compared with other popular models, and multilingual training can further improve the average performance. Notably, multilingual large language models (e.g., BLOOM) are still inadequate to perform CLSP tasks. We also find that the performance gap between monolingual training and cross-lingual transfer learning is still significant for multilingual models, though it can be mitigated by cross-lingual few-shot training. Our dataset and code are available at https://github.com/psunlpgroup/XSemPLR.


A Novel Approach To User Agent String Parsing For Vulnerability Analysis Using Mutli-Headed Attention

arXiv.org Artificial Intelligence

The increasing reliance on the internet has led to the proliferation of a diverse set of web-browsers and operating systems (OSs) capable of browsing the web. User agent strings (UASs) are a component of web browsing that are transmitted with every Hypertext Transfer Protocol (HTTP) request. They contain information about the client device and software, which is used by web servers for various purposes such as content negotiation and security. However, due to the proliferation of various browsers and devices, parsing UASs is a non-trivial task due to a lack of standardization of UAS formats. Current rules-based approaches are often brittle and can fail when encountering such non-standard formats. In this work, a novel methodology for parsing UASs using Multi-Headed Attention Based transformers is proposed. The proposed methodology exhibits strong performance in parsing a variety of UASs with differing formats. Furthermore, a framework to utilize parsed UASs to estimate the vulnerability scores for large sections of publicly visible IT networks or regions is also discussed. The methodology present here can also be easily extended or deployed for real-time parsing of logs in enterprise settings.


Convergence and Diversity in the Control Hierarchy

arXiv.org Artificial Intelligence

Weir has defined a hierarchy of language classes whose second member ($\mathcal{L}_2$) is generated by tree-adjoining grammars (TAG), linear indexed grammars (LIG), combinatory categorial grammars, and head grammars. The hierarchy is obtained using the mechanism of control, and $\mathcal{L}_2$ is obtained using a context-free grammar (CFG) whose derivations are controlled by another CFG. We adapt Weir's definition of a controllable CFG to give a definition of controllable pushdown automata (PDAs). This yields three new characterizations of $\mathcal{L}_2$ as the class of languages generated by PDAs controlling PDAs, PDAs controlling CFGs, and CFGs controlling PDAs. We show that these four formalisms are not only weakly equivalent but equivalent in a stricter sense that we call d-weak equivalence. Furthermore, using an even stricter notion of equivalence called d-strong equivalence, we make precise the intuition that a CFG controlling a CFG is a TAG, a PDA controlling a PDA is an embedded PDA, and a PDA controlling a CFG is a LIG. The fourth member of this family, a CFG controlling a PDA, does not correspond to any formalism we know of, so we invent one and call it a Pushdown Adjoining Automaton.


A logical word embedding for learning grammar

arXiv.org Artificial Intelligence

We introduce the logical grammar emdebbing (LGE), a model inspired by pregroup grammars and categorial grammars to enable unsupervised inference of lexical categories and syntactic rules from a corpus of text. LGE produces comprehensible output summarizing its inferences, has a completely transparent process for producing novel sentences, and can learn from as few as a hundred sentences.


Learning to Substitute Spans towards Improving Compositional Generalization

arXiv.org Artificial Intelligence

Despite the rising prevalence of neural sequence models, recent empirical evidences suggest their deficiency in compositional generalization. One of the current de-facto solutions to this problem is compositional data augmentation, aiming to incur additional compositional inductive bias. Nonetheless, the improvement offered by existing handcrafted augmentation strategies is limited when successful systematic generalization of neural sequence models requires multi-grained compositional bias (i.e., not limited to either lexical or structural biases only) or differentiation of training sequences in an imbalanced difficulty distribution. To address the two challenges, we first propose a novel compositional augmentation strategy dubbed \textbf{Span} \textbf{Sub}stitution (SpanSub) that enables multi-grained composition of substantial substructures in the whole training set. Over and above that, we introduce the \textbf{L}earning \textbf{to} \textbf{S}ubstitute \textbf{S}pan (L2S2) framework which empowers the learning of span substitution probabilities in SpanSub in an end-to-end manner by maximizing the loss of neural sequence models, so as to outweigh those challenging compositions with elusive concepts and novel surroundings. Our empirical results on three standard compositional generalization benchmarks, including SCAN, COGS and GeoQuery (with an improvement of at most 66.5\%, 10.3\%, 1.2\%, respectively), demonstrate the superiority of SpanSub, %the learning framework L2S2 and their combination.


Enhancing Language Representation with Constructional Information for Natural Language Understanding

arXiv.org Artificial Intelligence

Natural language understanding (NLU) is an essential branch of natural language processing, which relies on representations generated by pre-trained language models (PLMs). However, PLMs primarily focus on acquiring lexico-semantic information, while they may be unable to adequately handle the meaning of constructions. To address this issue, we introduce construction grammar (CxG), which highlights the pairings of form and meaning, to enrich language representation. We adopt usage-based construction grammar as the basis of our work, which is highly compatible with statistical models such as PLMs. Then a HyCxG framework is proposed to enhance language representation through a three-stage solution. First, all constructions are extracted from sentences via a slot-constraints approach. As constructions can overlap with each other, bringing redundancy and imbalance, we formulate the conditional max coverage problem for selecting the discriminative constructions. Finally, we propose a relational hypergraph attention network to acquire representation from constructional information by capturing high-order word interactions among constructions. Extensive experiments demonstrate the superiority of the proposed model on a variety of NLU tasks.


MCTS: A Multi-Reference Chinese Text Simplification Dataset

arXiv.org Artificial Intelligence

Text simplification aims to make the text easier to understand by applying rewriting transformations. There has been very little research on Chinese text simplification for a long time. The lack of generic evaluation data is an essential reason for this phenomenon. In this paper, we introduce MCTS, a multi-reference Chinese text simplification dataset. We describe the annotation process of the dataset and provide a detailed analysis of it. Furthermore, we evaluate the performance of some unsupervised methods and advanced large language models. We hope to build a basic understanding of Chinese text simplification through the foundational work and provide references for future research. We release our data at https://github.com/blcuicall/mcts.