subordinate clause
That's Optional: A Contemporary Exploration of "that" Omission in English Subordinate Clauses
First, effectiveness of their utterances when faced with we extend the investigation to a much larger corpus multiple options for structuring a message. The of informal written English collected from social UID hypothesis (Frank and Jaeger, 2008; Collins, media. Second, we use contemporary large language 2014; Hahn et al., 2020) suggests that speakers models (LLMs) to estimate the operationalizations tend to spread information evenly throughout an of information uniformity in syntactic reduction, utterance, avoiding large fluctuations in the perunit suggesting the robustness of our findings.
Distinguishing Translations by Human, NMT, and ChatGPT: A Linguistic and Statistical Approach
Jiang, Zhaokun, Lv, Qianxi, Zhang, Ziyin
The growing popularity of neural machine translation (NMT) and LLMs represented by ChatGPT underscores the need for a deeper understanding of their distinct characteristics and relationships. Such understanding is crucial for language professionals and researchers to make informed decisions and tactful use of these cutting-edge translation technology, but remains underexplored. This study aims to fill this gap by investigating three key questions: (1) the distinguishability of ChatGPT-generated translations from NMT and human translation (HT), (2) the linguistic characteristics of each translation type, and (3) the degree of resemblance between ChatGPT-produced translations and HT or NMT. To achieve these objectives, we employ statistical testing, machine learning algorithms, and multidimensional analysis (MDA) to analyze Spokesperson's Remarks and their translations. After extracting a wide range of linguistic features, supervised classifiers demonstrate high accuracy in distinguishing the three translation types, whereas unsupervised clustering techniques do not yield satisfactory results. Another major finding is that ChatGPT-produced translations exhibit greater similarity with NMT than HT in most MDA dimensions, which is further corroborated by distance computing and visualization. These novel insights shed light on the interrelationships among the three translation types and have implications for the future advancements of NMT and generative AI.
Traditional Readability Formulas Compared for English
Lee, Bruce W., Lee, Jason Hyung-Jong
Traditional English readability formulas, or equations, were largely developed in the 20th century. Nonetheless, many researchers still rely on them for various NLP applications. This phenomenon is presumably due to the convenience and straightforwardness of readability formulas. In this work, we contribute to the NLP community by 1. introducing New English Readability Formula (NERF), 2. recalibrating the coefficients of old readability formulas (Flesch-Kincaid Grade Level, Fog Index, SMOG Index, Coleman-Liau Index, and Automated Readability Index), 3. evaluating the readability formulas, for use in text simplification studies and medical texts, and 4. developing a Python-based program for the wide application to various NLP projects.
A description of Turkish Discourse Bank 1.2 and an examination of common dependencies in Turkish discourse
Zeyrek, Deniz, Er, Mustafa Erolcan
We describe Turkish Discourse Bank 1.2, the latest version of a discourse corpus annotated for explicitly or implicitly conveyed discourse relations, their constitutive units, and senses in the Penn Discourse Treebank style. We present an evaluation of the recently added tokens and examine three commonly occurring dependency patterns that hold among the constitutive units of a pair of adjacent discourse relations, namely, shared arguments, full embedding and partial containment of a discourse relation. We present three major findings: (a) implicitly conveyed relations occur more often than explicitly conveyed relations in the data; (b) it is much more common for two adjacent implicit discourse relations to share an argument than for two adjacent explicit relations to do so; (c) both full embedding and partial containment of discourse relations are pervasive in the corpus, which can be partly due to subordinator connectives whose preposed subordinate clause tends to be selected together with the matrix clause rather than being selected alone. Finally, we briefly discuss the implications of our findings for Turkish discourse parsing.
The Rise and Fall of the English Sentence - Issue 54: The Unspoken
"[[[When in the course of human events it becomes necessary for one people [to dissolve the political bands [which have connected them with another]] and [to assume among the powers of the earth, the separate and equal station [to which the laws of Nature and of Nature's God entitle them]]], a decent respect to the opinions of mankind requires [that they should declare the causes [which impel them to the separation]]]." But how did it ever make its way into the world? At 71 words, it is composed of eight separate clauses, each anchored by its own verb, nested within one another in various arrangements. The main clause (a decent respect to the opinions of mankind requires …) hangs suspended above a 50-word subordinate clause that must first be unfurled. To some linguists, Noam Chomsky among them, sentences like these illustrate an essential property of human language. These scientists have argued that recursion, a technique that allows chunks of language such as sentences to be embedded inside each other (with no hard limit on the number of nestings) is a universal human ability, perhaps even the one uniquely human ability that supports language. It's what allows us to create--literally--an infinite variety of novel sentences out of a limited inventory of words.
Learning Sentence-internal Temporal Relations
In this paper we propose a data intensive approach for inferring sentence-internal temporal relations. Temporal inference is relevant for practical NLP applications which either extract or synthesize temporal information (e.g., summarisation, question answering). Our method bypasses the need for manual coding by exploiting the presence of markers like after", which overtly signal a temporal relation. We first show that models trained on main and subordinate clauses connected with a temporal marker achieve good performance on a pseudo-disambiguation task simulating temporal inference (during testing the temporal marker is treated as unseen and the models must select the right marker from a set of possible candidates). Secondly, we assess whether the proposed approach holds promise for the semi-automatic creation of temporal annotations. Specifically, we use a model trained on noisy and approximate data (i.e., main and subordinate clauses) to predict intra-sentential relations present in TimeBank, a corpus annotated rich temporal information. Our experiments compare and contrast several probabilistic models differing in their feature space, linguistic assumptions and data requirements. We evaluate performance against gold standard corpora and also against human subjects.
Learning Sentence-internal Temporal Relations
In this paper we propose a data intensive approach for inferring sentence-internal temporal relations. Temporal inference is relevant for practical NLP applications which either extract or synthesize temporal information (e.g., summarisation, question answering). Our method bypasses the need for manual coding by exploiting the presence of markers like ``after", which overtly signal a temporal relation. We first show that models trained on main and subordinate clauses connected with a temporal marker achieve good performance on a pseudo-disambiguation task simulating temporal inference (during testing the temporal marker is treated as unseen and the models must select the right marker from a set of possible candidates). Secondly, we assess whether the proposed approach holds promise for the semi-automatic creation of temporal annotations. Specifically, we use a model trained on noisy and approximate data (i.e., main and subordinate clauses) to predict intra-sentential relations present in TimeBank, a corpus annotated rich temporal information. Our experiments compare and contrast several probabilistic models differing in their feature space, linguistic assumptions and data requirements. We evaluate performance against gold standard corpora and also against human subjects.