Goto

Collaborating Authors

 Grammars & Parsing


A Survey on Table Question Answering: Recent Advances

arXiv.org Artificial Intelligence

Table Question Answering (Table QA) refers to providing precise answers from tables to answer a user's question. In recent years, there have been a lot of works on table QA, but there is a lack of comprehensive surveys on this research topic. Hence, we aim to provide an overview of available datasets and representative methods in table QA. We classify existing methods for table QA into five categories according to their techniques, which include semantic-parsing-based, generative, extractive, matching-based, and retriever-reader-based methods. Moreover, because table QA is still a challenging task for existing methods, we also identify and outline several key challenges and discuss the potential future directions of table QA.


Learning grammar with a divide-and-concur neural network

arXiv.org Artificial Intelligence

We implement a divide-and-concur iterative projection approach to context-free grammar inference. Unlike most state-of-the-art models of natural language processing, our method requires a relatively small number of discrete parameters, making the inferred grammar directly interpretable -- one can read off from a solution how to construct grammatically valid sentences. Another advantage of our approach is the ability to infer meaningful grammatical rules from just a few sentences, compared to the hundreds of gigabytes of training data many other models employ. We demonstrate several ways of applying our approach: classifying words and inferring a grammar from scratch, taking an existing grammar and refining its categories and rules, and taking an existing grammar and expanding its lexicon as it encounters new words in new data.


Building a Relation Extraction Baseline for Gene-Disease Associations: A Reproducibility Study

arXiv.org Artificial Intelligence

Reproducibility is an important task in scientific research. It is crucial for researchers to compare newly developed systems with the state-of-the-art to assess whether they made a breakthrough. However previous works may not be immediately reproducible, for example due to the lack of source code. In this work we reproduce DEXTER, a system to automatically extract Gene-Disease Associations (GDAs) from biomedical abstracts.[1] The goal is to provide a benchmark for future works regarding Relation Extraction (RE), enabling researchers to test and compare their results.


AI

#artificialintelligence

The purposeful exchange of information caused by the creation and perception of signals drawn from a shared system of conventional signs is known as communication. Most animals employ signals to convey vital messages: there's food here, there's a predator nearby, approach, recede, and let's mate. Communication can help agents succeed in a partially visible world because they can learn knowledge that others have observed or inferred. Humans are the most talkative of all species, thus computer agents will need to master the language if they are to be useful. Language models for communication are examined in this chapter.


A Double-Graph Based Framework for Frame Semantic Parsing

arXiv.org Artificial Intelligence

Frame semantic parsing is a fundamental NLP task, which consists of three subtasks: frame identification, argument identification and role classification. Most previous studies tend to neglect relations between different subtasks and arguments and pay little attention to ontological frame knowledge defined in FrameNet. In this paper, we propose a Knowledge-guided Incremental semantic parser with Double-graph (KID). We first introduce Frame Knowledge Graph (FKG), a heterogeneous graph containing both frames and FEs (Frame Elements) built on the frame knowledge so that we can derive knowledge-enhanced representations for frames and FEs. Besides, we propose Frame Semantic Graph (FSG) to represent frame semantic structures extracted from the text with graph structures. In this way, we can transform frame semantic parsing into an incremental graph construction problem to strengthen interactions between subtasks and relations between arguments. Our experiments show that KID outperforms the previous state-of-the-art method by up to 1.7 F1-score on two FrameNet datasets. Our code is availavle at https://github.com/PKUnlp-icler/KID.


Natural Language Processing: Part of Speech Tagging - PythonAlgos

#artificialintelligence

Part of Speech (POS) Tagging is an integral part of Natural Language Processing (NLP). The first step in most state of the art NLP pipelines is tokenization. Tokenization is the separating of text into "tokens". Tokens are generally regarded as individual pieces of languages โ€“ words, whitespace, and punctuation. Once we tokenize our text we can tag it with the part of speech, note that this article only covers the details of part of speech tagging for English.


Stylistic Fingerprints, POS-tags and Inflected Languages: A Case Study in Polish

arXiv.org Artificial Intelligence

In stylometric investigations, frequencies of the most frequent words (MFWs) and character n-grams outperform other style-markers, even if their performance varies significantly across languages. In inflected languages, word endings play a prominent role, and hence different word forms cannot be recognized using generic text tokenization. Countless inflected word forms make frequencies sparse, making most statistical procedures complicated. Presumably, applying one of the NLP techniques, such as lemmatization and/or parsing, might increase the performance of classification. The aim of this paper is to examine the usefulness of grammatical features (as assessed via POS-tag n-grams) and lemmatized forms in recognizing authorial profiles, in order to address the underlying issue of the degree of freedom of choice within lexis and grammar. Using a corpus of Polish novels, we performed a series of supervised authorship attribution benchmarks, in order to compare the classification accuracy for different types of lexical and syntactic style-markers. Even if the performance of POS-tags as well as lemmatized forms was notoriously worse than that of lexical markers, the difference was not substantial and never exceeded ca. 15%.


Intel AVX-512 A Big Win For... JSON Parsing Performance

#artificialintelligence

In addition to the many HPC workloads and other scientific computing tasks where Intel's AVX-512 performance on their latest processor proves very beneficial, it also turns out AVX-512 can provide significant benefit to a much more mundane web server task: JSON parsing. The simdjson project that is focused on "parsing gigabytes of JSON per second" this week issued simdjson 2.0 and is headlined by an Intel-led contribution of AVX-512 support. The JavaScript Object Notation (JSON) data interchange format is heavily used by practically all major websites/web-applications in some capacity and can be dealt with by pretty much all programming languages. JSON really need not any introduction. The past few years there has been simdjson as an open-source (Apache 2.0 licensed) project aimed at delivering the highest performance JSON parser that can parse "gigabytes of JSON per second" and claims of being 4 25x faster than alternatives.


Pre-training Data Quality and Quantity for a Low-Resource Language: New Corpus and BERT Models for Maltese

arXiv.org Artificial Intelligence

Multilingual language models such as mBERT have seen impressive cross-lingual transfer to a variety of languages, but many languages remain excluded from these models. In this paper, we analyse the effect of pre-training with monolingual data for a low-resource language that is not included in mBERT -- Maltese -- with a range of pre-training set ups. We conduct evaluations with the newly pre-trained models on three morphosyntactic tasks -- dependency parsing, part-of-speech tagging, and named-entity recognition -- and one semantic classification task -- sentiment analysis. We also present a newly created corpus for Maltese, and determine the effect that the pre-training data size and domain have on the downstream performance. Our results show that using a mixture of pre-training domains is often superior to using Wikipedia text only. We also find that a fraction of this corpus is enough to make significant leaps in performance over Wikipedia-trained models. We pre-train and compare two models on the new corpus: a monolingual BERT model trained from scratch (BERTu), and a further pre-trained multilingual BERT (mBERTu). The models achieve state-of-the-art performance on these tasks, despite the new corpus being considerably smaller than typically used corpora for high-resourced languages. On average, BERTu outperforms or performs competitively with mBERTu, and the largest gains are observed for higher-level tasks.


How Writing SQL Could Get a Whole Lot Easier With NLQ

#artificialintelligence

What is the most intuitive, efficient, and least mentally draining way to ask a question? It is using the simplest words possible in your own language. Modern search engines such as Google has made searching for information online using simple sentences commonplace. This had helped create our modern society and improved access to information globally; it's hard to overstate how transformational the advent of the search engine truly was. However, searching for information on the internet didn't truly become democratized and popular until we could ask the internet questions using natural language in the same way we would talk to another person.