Grammars & Parsing
A Preliminary Study for Literary Rhyme Generation based on Neuronal Representation, Semantics and Shallow Parsing
Moreno-Jiménez, Luis-Gil, Torres-Moreno, Juan-Manuel, Wedemann, Roseli S.
For many years, research in Artificial Intelligence (AI) has directed efforts towards automating processes to perform specific academic, industrial or economic tasks for society. However, the investigation and development of procedures for the automation of human artistic and creative processes has not had as much attention due to the complexities involved in these activities. Procedures developed for these purposes involve mathematical-computational methods designed to process and learn from a large quantity of digital data, so as to detect patterns in order to simulate the creative process (CP), as explained by Boden in [3]. In this paper, we introduce a model for the generation of rhymes with literary components. Our proposal is based on findings detailed in [11], where Automatic Text Generation (ATG) techniques are combined with neural network (NN) based models, such as the Word2vec algorithm [9], for the generation of literary texts.
Machine Learning and the Challenge of Predicting Fake News
Many Natural Language Processing (NLP) techniques exist for detecting "fake news". Multi-phase algorithms with Determined Decision Trees, Gradient Enlargement, and others have been used by various researchers and organizations with varying results. One study from researchers at Rensselaer Polytechnic Institute reported 83% accuracy in predicting whether a news article is from a reliable or unreliable source [1], while Facebook's 2019 attempt at developing an algorithm failed miserably, with some users experiencing a "maelstrom" of fake news [2]. A new study, published in the November 2021 issue of the Journal of Emerging Technologies and Innovative Research [3] performs an analysis of a wide range of AI models for efficacy, finding that models generally perform poorly, ranging from 60% to 77% accuracy. Separating fake news from real news is a challenge even for the most sophisticated AI. Simple content-related programs and shallow marking of the speech part (POS) fail to consider contextual information and are unable to accurately classify news stories as fact or fake unless combined with more sophisticated algorithms.
ALP: Data Augmentation using Lexicalized PCFGs for Few-Shot Text Classification
Kim, Hazel, Woo, Daecheol, Oh, Seong Joon, Cha, Jeong-Won, Han, Yo-Sub
Data augmentation has been an important ingredient for boosting performances of learned models. Prior data augmentation methods for few-shot text classification have led to great performance boosts. However, they have not been designed to capture the intricate compositional structure of natural language. As a result, they fail to generate samples with plausible and diverse sentence structures. Motivated by this, we present the data Augmentation using Lexicalized Probabilistic context-free grammars (ALP) that generates augmented samples with diverse syntactic structures with plausible grammar. The lexicalized PCFG parse trees consider both the constituents and dependencies to produce a syntactic frame that maximizes a variety of word choices in a syntactically preservable manner without specific domain experts. Experiments on few-shot text classification tasks demonstrate that ALP enhances many state-of-the-art classification methods. As a second contribution, we delve into the train-val splitting methodologies when a data augmentation method comes into play. We argue empirically that the traditional splitting of training and validation sets is sub-optimal compared to our novel augmentation-based splitting strategies that further expand the training split with the same number of labeled data. Taken together, our contributions on the data augmentation strategies yield a strong training recipe for few-shot text classification tasks.
Two-view Graph Neural Networks for Knowledge Graph Completion
Tong, Vinh, Nguyen, Dai Quoc, Phung, Dinh, Nguyen, Dat Quoc
To this end, we propose a new KG embedding model, named A knowledge graph (KG) is a network of entity nodes and WGE, to leverage GNNs to capture entity-focused graph structure relationship edges, which can be represented as a collection and relation-focused graph structure for KG completion. of triples in the form of (h, r, t), wherein each triple (h, r, In particular, WGE transforms a given KG into two views. The t) represents a relation r between a head entity h and a tail first view--a single undirected entity-focused graph--only entity t. Here, entities are real-world things or objects such includes entities as nodes to provide the entity neighborhood as music tracks, movies persons, organizations, places and the information. The second view--a single undirected relationfocused like, while each relation type determines a certain relationship graph--considers both entities and relations as nodes, between entities. KGs are used in a number of commercial applications, constructed from constraints (subjective relation, predicate e.g. in such search engines as Google, Microsoft's entity, objective relation), to attain the potential dependence Bing and Facebook's Graph search. They also are useful between two neighborhood relations. Then WGE introduces a resources for many natural language processing tasks such as new encoder module of adopting two vanilla GNNs directly co-reference resolution ([1], [2]), semantic parsing ([3], [4]) on these two graph views to better update entity and relation and question answering ([5], [6]). However, an issue is that embeddings, followed by the decoder module using a weighted KGs are often incomplete, i.e., missing a lot of valid triples score function. In summary, our contributions are as follows: [7].
Pay More Attention to History: A Context Modeling Strategy for Conversational Text-to-SQL
Li, Yuntao, Zhang, Hanchu, Li, Yutian, Wang, Sirui, Wu, Wei, Zhang, Yan
Conversational text-to-SQL aims at converting multi-turn natural language queries into their corresponding SQL representations. One of the most intractable problem of conversational text-to-SQL is modeling the semantics of multi-turn queries and gathering proper information required for the current query. This paper shows that explicit modeling the semantic changes by adding each turn and the summarization of the whole context can bring better performance on converting conversational queries into SQLs. In particular, we propose two conversational modeling tasks in both turn grain and conversation grain. These two tasks simply work as auxiliary training tasks to help with multi-turn conversational semantic parsing. We conducted empirical studies and achieve new state-of-the-art results on large-scale open-domain conversational text-to-SQL dataset. The results demonstrate that the proposed mechanism significantly improves the performance of multi-turn semantic parsing.
The Rediscovery Hypothesis: Language Models Need to Meet Linguistics
Nikoulina, Vassilina | Tezekbayev, Maxat (Nazarbayev University) | Kozhakhmet, Nuradil (Nazarbayev University) | Babazhanova, Madina (Nazarbayev University) | Gallé, Matthias (Naver Labs Europe) | Assylbekov, Zhenisbek (Nazarbayev University)
There is an ongoing debate in the NLP community whether modern language models contain linguistic knowledge, recovered through so-called probes. In this paper, we study whether linguistic knowledge is a necessary condition for the good performance of modern language models, which we call the rediscovery hypothesis. In the first place, we show that language models that are significantly compressed but perform well on their pretraining objectives retain good scores when probed for linguistic structures. This result supports the rediscovery hypothesis and leads to the second contribution of our paper: an information-theoretic framework that relates language modeling objectives with linguistic information. This framework also provides a metric to measure the impact of linguistic information on the word prediction task. We reinforce our analytical results with various experiments, both on synthetic and on real NLP tasks in English.
Interscript: A dataset for interactive learning of scripts through error feedback
Tandon, Niket, Madaan, Aman, Clark, Peter, Sakaguchi, Keisuke, Yang, Yiming
How can an end-user provide feedback if a deployed structured prediction model generates inconsistent output, ignoring the structural complexity of human language? This is an emerging topic with recent progress in synthetic or constrained settings, and the next big leap would require testing and tuning models in real-world settings. We present a new dataset, Interscript, containing user feedback on a deployed model that generates complex everyday tasks. Interscript contains 8,466 data points -- the input is a possibly erroneous script and a user feedback, and the output is a modified script. We posit two use-cases of \ours that might significantly advance the state-of-the-art in interactive learning. The dataset is available at: https://github.com/allenai/interscript.
Maximum Bayes Smatch Ensemble Distillation for AMR Parsing
Lee, Young-Suk, Astudillo, Ramon Fernandez, Hoang, Thanh Lam, Naseem, Tahira, Florian, Radu, Roukos, Salim
AMR parsing has experienced an unprecendented increase in performance in the last three years, due to a mixture of effects including architecture improvements and transfer learning. Self-learning techniques have also played a role in pushing performance forward. However, for most recent high performant parsers, the effect of self-learning and silver data generation seems to be fading. In this paper we show that it is possible to overcome this diminishing returns of silver data by combining Smatch-based ensembling techniques with ensemble distillation. In an extensive experimental setup, we push single model English parser performance above 85 Smatch for the first time and return to substantial gains. We also attain a new state-of-the-art for cross-lingual AMR parsing for Chinese, German, Italian and Spanish. Finally we explore the impact of the proposed distillation technique on domain adaptation, and show that it can produce gains rivaling those of human annotated data for QALD-9 and achieve a new state-of-the-art for BioAMR.
Text to SQL Queries
WikiSQL is one of the most popular benchmarks in semantic parsing. It is a supervised text-to-SQL dataset, beautifully hand-annotated by Amazon Mechanical Turk. Some of the early works on WikiSQL modeled this as a sequence generation problem using seq2seq but we are moving away from it. The text has to be cleaned before passing it to the model like doing decontraction of the words, removing stop words, removing non-alphanumeric text from the corpus. As we have the dataset in SQL queries and headers, so we have to featurize the text using a tokenizer from the nltk library and then concatenate the query and headers.
Natural Answer Generation: From Factoid Answer to Full-length Answer using Grammar Correction
Jain, Manas, Saha, Sriparna, Bhattacharyya, Pushpak, Chinnadurai, Gladvin, Vatsa, Manish Kumar
Question Answering systems these days typically use template-based language generation. Though adequate for a domain-specific task, these systems are too restrictive and predefined for domain-independent systems. This paper proposes a system that outputs a full-length answer given a question and the extracted factoid answer (short spans such as named entities) as the input. Our system uses constituency and dependency parse trees of questions. A transformer-based Grammar Error Correction model GECToR (2020), is used as a post-processing step for better fluency. We compare our system with (i) Modified Pointer Generator (SOTA) and (ii) Fine-tuned DialoGPT for factoid questions. We also test our approach on existential (yes-no) questions with better results. Our model generates accurate and fluent answers than the state-of-the-art (SOTA) approaches. The evaluation is done on NewsQA and SqUAD datasets with an increment of 0.4 and 0.9 percentage points in ROUGE-1 score respectively. Also the inference time is reduced by 85\% as compared to the SOTA. The improved datasets used for our evaluation will be released as part of the research contribution.