AITopics

2305.16328

Country:

North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
Asia > India (0.04)
North America > United States > Pennsylvania > Allegheny County > Pittsburgh (0.04)
(6 more...)

Genre: Research Report (0.81)

Industry:

Education (0.67)
Health & Medicine > Therapeutic Area (0.34)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Text Processing (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
(2 more...)

arXiv.org Artificial IntelligenceMay-12-2023

Discourse Analysis via Questions and Answers: Parsing Dependency Structures of Questions Under Discussion

Ko, Wei-Jen, Wu, Yating, Dalton, Cutter, Srinivas, Dananjay, Durrett, Greg, Li, Junyi Jessy

Automatic discourse processing is bottlenecked by data: current discourse formalisms pose highly demanding annotation tasks involving large taxonomies of discourse relations, making them inaccessible to lay annotators. This work instead adopts the linguistic framework of Questions Under Discussion (QUD) for discourse analysis and seeks to derive QUD structures automatically. QUD views each sentence as an answer to a question triggered in prior context; thus, we characterize relationships between sentences as free-form questions, in contrast to exhaustive fine-grained taxonomies. We develop the first-of-its-kind QUD parser that derives a dependency structure of questions over full documents, trained using a large, crowdsourced question-answering dataset DCQA (Ko et al., 2022). Human evaluation results show that QUD dependency parsing is possible for language models trained with this crowdsourced, generalizable annotation scheme. We illustrate how our QUD structure is distinct from RST trees, and demonstrate the utility of QUD analysis in the context of document simplification. Our findings show that QUD parsing is an appealing alternative for automatic discourse processing.

artificial intelligence, natural language, proceedings, (17 more...)

2210.05905

Country:

North America > United States > California (0.05)
North America > United States > Nevada (0.04)
Europe > Bosnia and Herzegovina > Federation of Bosnia and Herzegovina > Sarajevo Canton > Sarajevo (0.04)
(4 more...)

Genre: Research Report > New Finding (1.00)

Industry:

Government > Regional Government > North America Government > United States Government (1.00)
Law (0.93)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Grammars & Parsing (1.00)
Information Technology > Artificial Intelligence > Natural Language > Discourse & Dialogue (1.00)

Kalpakchi, Dmytro, Boye, Johan

Quinductor: a multilingual data-driven method for generating reading-comprehension questions using Universal Dependencies

arXiv.org Artificial IntelligenceMay-12-2023

We propose a multilingual data-driven method for generating reading comprehension questions using dependency trees. Our method provides a strong, mostly deterministic, and inexpensive-totrain baseline for less-resourced languages. While a language-specific corpus is still required, its size is nowhere near those required by modern neural question generation (QG) architectures. Our method surpasses QG baselines previously reported in the literature and shows a good performance in terms of human evaluation. 1 Introduction We are interested in question generation (QG) - the task of automatically generating reading comprehension questions and their correct answers from given declarative sentences. Numerous methods have been proposed for solving this task, most of which have been aimed at the English language. Recent methods are based on neural networks and rely on the availability of large-scale datasets, such as SQuAD (Rajpurkar et al. 2016) - a question-answering dataset repurposed for QG - or large-scale pretrained models, such as GPT-3 (Brown et al. 2020). Early methods, mostly based on context-free grammars, relied on the strict word order and the limited inflectional morphology of English. These traits made it relatively straightforward to craft handwritten templates based on these grammars. The above mentioned idiosyncracies and the unique availability of large-scale resources for English leave a number of open challenges for developing QG methods applicable to languages other than English. The first challenge is the lack of large-scale training datasets, and a prohibitively high cost of obtaining such resources. State-of-the-art QG methods for English train their models on the previously mentioned SQuAD dataset, which contains more than 100,000 questions. Obtaining a good-quality dataset of a similar size is very expensive, especially for languages with fewer native speakers around the world. The second challenge is knowing how well available methods developed for English would generalize to other languages, especially synthetic ones with richer inflectional morphology and less strict word order (e.g., Finnish, Turkish or Russian). To the best of our knowledge, not much research has been done on QG for these kinds of languages. The third challenge is assessing the obtained performance results.

machine learning, natural language, question answering, (23 more...)

2103.10121

Country:

South America > Brazil (0.04)
Europe > Sweden > Stockholm > Stockholm (0.04)
North America > United States > Pennsylvania > Allegheny County > Pittsburgh (0.04)
(2 more...)

Genre:

Workflow (0.93)
Research Report (0.63)

Industry: Education > Assessment & Standards > Student Performance (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Text Processing (0.92)
Information Technology > Artificial Intelligence > Natural Language > Question Answering (0.91)
Information Technology > Artificial Intelligence > Natural Language > Grammars & Parsing (0.66)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.34)

arXiv.org Artificial IntelligenceMay-11-2023

The Pipeline System of ASR and NLU with MLM-based Data Augmentation toward STOP Low-resource Challenge

Futami, Hayato, Huynh, Jessica, Arora, Siddhant, Wu, Shih-Lun, Kashiwagi, Yosuke, Peng, Yifan, Yan, Brian, Tsunoo, Emiru, Watanabe, Shinji

This paper describes our system for the low-resource domain adaptation track (Track 3) in Spoken Language Understanding Grand Challenge, which is a part of ICASSP Signal Processing Grand Challenge 2023. In the track, we adopt a pipeline approach of ASR and NLU. For ASR, we fine-tune Whisper for each domain with upsampling. For NLU, we fine-tune BART on all the Track3 data and then on low-resource domain data. We apply masked LM (MLM) -based data augmentation, where some of input tokens and corresponding target labels are replaced using MLM. We also apply a retrieval-based approach, where model input is augmented with similar training samples. As a result, we achieved exact match (EM) accuracy 63.3/75.0 (average: 69.15) for reminder/weather domain, and won the 1st place at the challenge.

artificial intelligence, data augmentation, natural language, (16 more...)

2305.01194

Country: North America > United States (0.15)

Genre: Research Report (0.40)

Industry: Energy > Oil & Gas > Midstream (0.41)

Technology:

Information Technology > Artificial Intelligence > Speech > Speech Recognition (0.51)
Information Technology > Artificial Intelligence > Natural Language > Grammars & Parsing (0.35)

McPheat, Lachlan, Sadrzadeh, Mehrnoosh, Wazni, Hadi, Wijnholds, Gijs

Categorical Vector Space Semantics for Lambek Calculus with a Relevant Modality

arXiv.org Artificial IntelligenceMay-11-2023

We develop a categorical compositional distributional semantics for Lambek Calculus with a Relevant Modality !L*, which has a limited edition of the contraction and permutation rules. The categorical part of the semantics is a monoidal biclosed category with a coalgebra modality, very similar to the structure of a Differential Category. We instantiate this category to finite dimensional vector spaces and linear maps via "quantisation" functors and work with three concrete interpretations of the coalgebra modality. We apply the model to construct categorical and concrete semantic interpretations for the motivating example of !L*: the derivation of a phrase with a parasitic gap. The effectiveness of the concrete interpretations are evaluated via a disambiguation task, on an extension of a sentence disambiguation dataset to parasitic gap phrases, using BERT, Word2Vec, and FastText vectors and Relational tensors.

artificial intelligence, machine learning, natural language, (18 more...)

doi: 10.32408/compositionality-5-2

2005.03074

Country:

North America > United States > New York > New York County > New York City (0.14)
North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.14)
(8 more...)

Genre: Research Report (0.63)

Industry: Education (0.46)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Grammars & Parsing (0.93)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.88)
Information Technology > Artificial Intelligence > Machine Learning > Supervised Learning > Representation Of Examples (0.63)

arXiv.org Artificial IntelligenceMay-10-2023

SPSQL: Step-by-step Parsing Based Framework for Text-to-SQL Generation

Shen, Ran, Sun, Gang, Shen, Hao, Li, Yiling, Jin, Liangfeng, Jiang, Han

Converting text into the structured query language (Text2SQL) is a research hotspot in the field of natural language processing (NLP), which has broad application prospects. In the era of big data, the use of databases has penetrated all walks of life, in which the collected data is large in scale, diverse in variety, and wide in scope, making the data query cumbersome and inefficient, and putting forward higher requirements for the Text2SQL model. In practical applications, the current mainstream end-to-end Text2SQL model is not only difficult to build due to its complex structure and high requirements for training data, but also difficult to adjust due to massive parameters. In addition, the accuracy of the model is hard to achieve the desired result. Based on this, this paper proposes a pipelined Text2SQL method: SPSQL. This method disassembles the Text2SQL task into four subtasks--table selection, column selection, SQL generation, and value filling, which can be converted into a text classification problem, a sequence labeling problem, and two text generation problems, respectively. Then, we construct data formats of different subtasks based on existing data and improve the accuracy of the overall model by improving the accuracy of each submodel. We also use the named entity recognition module and data augmentation to optimize the overall model. We construct the dataset based on the marketing business data of the State Grid Corporation of China. Experiments demonstrate our proposed method achieves the best performance compared with the end-to-end method and other pipeline methods.

artificial intelligence, machine learning, natural language, (18 more...)

2305.11061

Country: Asia > China > Zhejiang Province > Hangzhou (0.05)

Genre: Research Report (0.64)

Industry: Energy (0.49)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.70)
Information Technology > Artificial Intelligence > Natural Language > Text Processing (0.69)
Information Technology > Artificial Intelligence > Natural Language > Grammars & Parsing (0.65)

Fernández-González, Daniel

Structured Sentiment Analysis as Transition-based Dependency Parsing

Structured sentiment analysis (SSA) aims to automatically extract people's opinions from a text in natural language and adequately represent that information in a graph structure. One of the most accurate methods for performing SSA was recently proposed and consists of approaching it as a dependency parsing task. Although we can find in the literature how transition-based algorithms excel in dependency parsing in terms of accuracy and efficiency, all proposed attempts to tackle SSA following that approach were based on graph-based models. In this article, we present the first transition-based method to address SSA as dependency parsing. Specifically, we design a transition system that processes the input text in a left-to-right pass, incrementally generating the graph structure containing all identified opinions. To effectively implement our final transition-based model, we resort to a Pointer Network architecture as a backbone. From an extensive evaluation, we demonstrate that our model offers the best performance to date in practically all cases among prior dependency-based methods, and surpass recent task-specific techniques on the most challenging datasets. We additionally include an in-depth analysis and empirically prove that the overall time-complexity cost of our approach is quadratic in the sentence length, being more efficient than top-performing graph-based parsers.

computational linguistic, machine learning, natural language, (19 more...)

2305.05311

Country:

Europe (1.00)
North America > United States > Minnesota (0.28)

Genre: Research Report (1.00)

Industry:

Materials > Chemicals > Industrial Gases > Liquified Gas (0.67)
Materials > Chemicals > Commodity Chemicals > Petrochemicals > LNG (0.67)
Energy > Oil & Gas > Midstream (0.67)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Information Extraction (1.00)
Information Technology > Artificial Intelligence > Natural Language > Grammars & Parsing (1.00)
Information Technology > Artificial Intelligence > Natural Language > Discourse & Dialogue (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Chen, Le, Mahmud, Quazi Ishtiaque, Phan, Hung, Ahmed, Nesreen K., Jannesari, Ali

Learning to Parallelize with OpenMP by Augmented Heterogeneous AST Representation

Detecting parallelizable code regions is a challenging task, even for experienced developers. Numerous recent studies have explored the use of machine learning for code analysis and program synthesis, including parallelization, in light of the success of machine learning in natural language processing. However, applying machine learning techniques to parallelism detection presents several challenges, such as the lack of an adequate dataset for training, an effective code representation with rich information, and a suitable machine learning model to learn the latent features of code for diverse analyses. To address these challenges, we propose a novel graph-based learning approach called Graph2Par that utilizes a heterogeneous augmented abstract syntax tree (Augmented-AST) representation for code. The proposed approach primarily focused on loop-level parallelization with OpenMP. Moreover, we create an OMP\_Serial dataset with 18598 parallelizable and 13972 non-parallelizable loops to train the machine learning models. Our results show that our proposed approach achieves the accuracy of parallelizable code region detection with 85\% accuracy and outperforms the state-of-the-art token-based machine learning approach. These results indicate that our approach is competitive with state-of-the-art tools and capable of handling loops with complex structures that other tools may overlook.

artificial intelligence, machine learning, natural language, (15 more...)

2305.05779

Country:

Asia > Middle East > Iran > Tehran Province > Tehran (0.04)
North America > United States > Massachusetts > Suffolk County > Boston (0.04)
North America > United States > Iowa (0.04)
(3 more...)

Genre: Research Report > New Finding (1.00)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)
Information Technology > Artificial Intelligence > Natural Language > Grammars & Parsing (0.89)

Cascading and Direct Approaches to Unsupervised Constituency Parsing on Spoken Sentences

Tseng, Yuan, Lai, Cheng-I, Lee, Hung-yi

Past work on unsupervised parsing is constrained to written form. In this paper, we present the first study on unsupervised spoken constituency parsing given unlabeled spoken sentences and unpaired textual data. The goal is to determine the spoken sentences' hierarchical syntactic structure in the form of constituency parse trees, such that each node is a span of audio that corresponds to a constituent. We compare two approaches: (1) cascading an unsupervised automatic speech recognition (ASR) model and an unsupervised parser to obtain parse trees on ASR transcripts, and (2) direct training an unsupervised parser on continuous word-level speech representations. This is done by first splitting utterances into sequences of word-level segments, and aggregating self-supervised speech representations within segments to obtain segment embeddings. We find that separately training a parser on the unpaired text and directly applying it on ASR transcripts for inference produces better results for unsupervised parsing. Additionally, our results suggest that accurate segmentation alone may be sufficient to parse spoken sentences accurately. Finally, we show the direct approach may learn head-directionality correctly for both head-initial and head-final languages without any explicit inductive bias.

artificial intelligence, natural language, transcript, (18 more...)

2303.08809

Country:

North America > United States (0.04)
North America > Canada > Quebec > Montreal (0.04)
Asia > Taiwan (0.04)

Genre: Research Report > New Finding (0.68)

Technology: Information Technology > Artificial Intelligence > Natural Language > Grammars & Parsing (1.00)

GAP-Gen: Guided Automatic Python Code Generation

Zhao, Junchen, Song, Yurun, Wang, Junlin, Harris, Ian G.

Automatic code generation from natural language descriptions can be highly beneficial during the process of software development. In this work, we propose GAP-Gen, a Guided Automatic Python Code Generation method based on Python syntactic constraints and semantic constraints. We first introduce Python syntactic constraints in the form of Syntax-Flow, which is a simplified version of Abstract Syntax Tree (AST) reducing the size and high complexity of Abstract Syntax Tree but maintaining crucial syntactic information of Python code. In addition to Syntax-Flow, we introduce Variable-Flow which abstracts variable and function names consistently through out the code. In our work, rather than pretraining, we focus on modifying the finetuning process which reduces computational requirements but retains high generation performance on automatic Python code generation task. GAP-Gen fine-tunes the transformer based language models T5 and CodeT5 using the Code-to-Docstring datasets CodeSearchNet, CodeSearchNet AdvTest and Code-Docstring Corpus from EdinburghNLP. Our experiments show that GAP-Gen achieves better results on automatic Python code generation task than previous works.

artificial intelligence, machine learning, natural language, (20 more...)

2201.0881

Country: North America > United States (0.68)

Genre: Research Report (1.00)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Automatic Programming (1.00)
Information Technology > Artificial Intelligence > Natural Language > Grammars & Parsing (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.89)