Grammars & Parsing
Trading Syntax Trees for Wordpieces: Target-oriented Opinion Words Extraction with Wordpieces and Aspect Enhancement
Mensah, Samuel, Sun, Kai, Aletras, Nikolaos
State-of-the-art target-oriented opinion word extraction (TOWE) models typically use BERT-based text encoders that operate on the word level, along with graph convolutional networks (GCNs) that incorporate syntactic information extracted from syntax trees. These methods achieve limited gains with GCNs and have difficulty using BERT wordpieces. Meanwhile, BERT wordpieces are known to be effective at representing rare words or words with insufficient context information. To address this issue, this work trades syntax trees for BERT wordpieces by entirely removing the GCN component from the methods' architectures. To enhance TOWE performance, we tackle the issue of aspect representation loss during encoding. Instead of solely utilizing a sentence as the input, we use a sentence-aspect pair. Our relatively simple approach achieves state-of-the-art results on benchmark datasets and should serve as a strong baseline for further research.
On the Intersection of Context-Free and Regular Languages
Pasti, Clemente, Opedal, Andreas, Pimentel, Tiago, Vieira, Tim, Eisner, Jason, Cotterell, Ryan
The Bar-Hillel construction is a classic result in formal language theory. It shows, by a simple construction, that the intersection of a context-free language and a regular language is itself context-free. In the construction, the regular language is specified by a finite-state automaton. However, neither the original construction (Bar-Hillel et al., 1961) nor its weighted extension (Nederhof and Satta, 2003) can handle finite-state automata with $\varepsilon$-arcs. While it is possible to remove $\varepsilon$-arcs from a finite-state automaton efficiently without modifying the language, such an operation modifies the automaton's set of paths. We give a construction that generalizes the Bar-Hillel in the case where the desired automaton has $\varepsilon$-arcs, and further prove that our generalized construction leads to a grammar that encodes the structure of both the input automaton and grammar while retaining the asymptotic size of the original construction.
Silver Syntax Pre-training for Cross-Domain Relation Extraction
Bassignana, Elisa, Ginter, Filip, Pyysalo, Sampo, van der Goot, Rob, Plank, Barbara
Relation Extraction (RE) remains a challenging task, especially when considering realistic out-of-domain evaluations. One of the main reasons for this is the limited training size of current RE datasets: obtaining high-quality (manually annotated) data is extremely expensive and cannot realistically be repeated for each new domain. An intermediate training step on data from related tasks has shown to be beneficial across many NLP tasks.However, this setup still requires supplementary annotated data, which is often not available. In this paper, we investigate intermediate pre-training specifically for RE. We exploit the affinity between syntactic structure and semantic RE, and identify the syntactic relations which are closely related to RE by being on the shortest dependency path between two entities. We then take advantage of the high accuracy of current syntactic parsers in order to automatically obtain large amounts of low-cost pre-training data. By pre-training our RE model on the relevant syntactic relations, we are able to outperform the baseline in five out of six cross-domain setups, without any additional annotated data.
EventNet-ITA: Italian Frame Parsing for Events
This paper introduces EventNet-ITA, a large, multi-domain corpus annotated with event frames for Italian, and presents an efficient approach for multi-label Frame Parsing. The approach is then evaluated on the dataset. Covering a wide range of individual, social and historical phenomena, the main contribution of EventNet-ITA is to provide the research community with a resource for textual event mining and a novel and extensive tool for Frame Parsing in Italian.
The Role of Semantic Parsing in Understanding Procedural Text
Faghihi, Hossein Rajaby, Kordjamshidi, Parisa, Teng, Choh Man, Allen, James
In this paper, we investigate whether symbolic semantic representations, extracted from deep semantic parsers, can help reasoning over the states of involved entities in a procedural text. We consider a deep semantic parser~(TRIPS) and semantic role labeling as two sources of semantic parsing knowledge. First, we propose PROPOLIS, a symbolic parsing-based procedural reasoning framework. Second, we integrate semantic parsing information into state-of-the-art neural models to conduct procedural reasoning. Our experiments indicate that explicitly incorporating such semantic knowledge improves procedural understanding. This paper presents new metrics for evaluating procedural reasoning tasks that clarify the challenges and identify differences among neural, symbolic, and integrated models.
Accessible Interfaces for the Development and Deployment of Robotic Platforms
Accessibility is one of the most important features in the design of robots and their interfaces. This thesis proposes methods that improve the accessibility of robots for three different target audiences: consumers, researchers, and learners. In order for humans and robots to work together effectively, they both must be able to communicate with each other. We tackle the problem of generating route instructions that are readily understandable by novice humans for the navigation of a priori unknown indoor environments. We then move on to the related problem of enabling robots to understand natural language utterances in the context of learning to operate articulated objects (e.g., fridges, drawers) by leveraging kinematic models. Next, we turn our focus to the development of accessible and reproducible robotic platforms for scientific research. We propose a new concept for reproducible robotics research that integrates development and benchmarking, so that reproducibility is obtained "by design" from the beginning of the research and development process. We then propose a framework called SHARC (SHared Autonomy for Remote Collaboration), to improve accessibility for underwater robotic intervention operations. SHARC allows multiple remote scientists to efficiently plan and execute high-level sampling procedures using an underwater manipulator while deferring low-level control to the robot. Lastly, we developed the first hardware-based MOOC in AI and robotics. This course allows learners to study autonomy hands-on by making real robots make their own decisions and accomplish broadly defined tasks. We design a new robotic platform from the ground up to support this new learning experience. A fully browser-based interface, based on leading tools and technologies for code development, testing, validation, and deployment serves to maximize the accessibility of these educational resources.
Open Korean Corpora: A Practical Report
Cho, Won Ik, Moon, Sangwhan, Song, Youngsook
Korean is often referred to as a low-resource language in the research community. While this claim is partially true, it is also because the availability of resources is inadequately advertised and curated. This work curates and reviews a list of Korean corpora, first describing institution-level resource development, then further iterate through a list of current open datasets for different types of tasks. We then propose a direction on how open-source dataset construction and releases should be done for less-resourced languages to promote research.
QURG: Question Rewriting Guided Context-Dependent Text-to-SQL Semantic Parsing
Chai, Linzheng, Xiao, Dongling, Yang, Jian, Yang, Liqun, Zhang, Qian-Wen, Cao, Yunbo, Li, Zhoujun, Yan, Zhao
Context-dependent Text-to-SQL aims to translate multi-turn natural language questions into SQL queries. Despite various methods have exploited context-dependence information implicitly for contextual SQL parsing, there are few attempts to explicitly address the dependencies between current question and question context. This paper presents QURG, a novel Question Rewriting Guided approach to help the models achieve adequate contextual understanding. Specifically, we first train a question rewriting model to complete the current question based on question context, and convert them into a rewriting edit matrix. We further design a two-stream matrix encoder to jointly model the rewriting relations between question and context, and the schema linking relations between natural language and structured schema. Experimental results show that QURG significantly improves the performances on two large-scale context-dependent datasets SParC and CoSQL, especially for hard and long-turn questions.
CQE: A Comprehensive Quantity Extractor
Almasian, Satya, Kazakova, Vivian, Gรถldner, Philip, Gertz, Michael
Quantities are essential in documents to describe factual information. They are ubiquitous in application domains such as finance, business, medicine, and science in general. Compared to other information extraction approaches, interestingly only a few works exist that describe methods for a proper extraction and representation of quantities in text. In this paper, we present such a comprehensive quantity extraction framework from text data. It efficiently detects combinations of values and units, the behavior of a quantity (e.g., rising or falling), and the concept a quantity is associated with. Our framework makes use of dependency parsing and a dictionary of units, and it provides for a proper normalization and standardization of detected quantities. Using a novel dataset for evaluation, we show that our open source framework outperforms other systems and -- to the best of our knowledge -- is the first to detect concepts associated with identified quantities. The code and data underlying our framework are available at https://github.com/vivkaz/CQE.
Frustratingly Easy Label Projection for Cross-lingual Transfer
Chen, Yang, Jiang, Chao, Ritter, Alan, Xu, Wei
Translating training data into many languages has emerged as a practical solution for improving cross-lingual transfer. For tasks that involve span-level annotations, such as information extraction or question answering, an additional label projection step is required to map annotated spans onto the translated texts. Recently, a few efforts have utilized a simple mark-then-translate method to jointly perform translation and projection by inserting special markers around the labeled spans in the original sentence. However, as far as we are aware, no empirical analysis has been conducted on how this approach compares to traditional annotation projection based on word alignment. In this paper, we present an extensive empirical study across 57 languages and three tasks (QA, NER, and Event Extraction) to evaluate the effectiveness and limitations of both methods, filling an important gap in the literature. Experimental results show that our optimized version of mark-then-translate, which we call EasyProject, is easily applied to many languages and works surprisingly well, outperforming the more complex word alignment-based methods. We analyze several key factors that affect the end-task performance, and show EasyProject works well because it can accurately preserve label span boundaries after translation. We will publicly release all our code and data.