stylometry
- North America > Canada > Ontario > Toronto (0.14)
- North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
- North America > United States > Massachusetts > Suffolk County > Boston (0.04)
- (3 more...)
- Information Technology > Artificial Intelligence > Vision (1.00)
- Information Technology > Artificial Intelligence > Natural Language (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
- Information Technology > Artificial Intelligence > Speech (0.95)
- North America > Canada > Ontario > Toronto (0.14)
- North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
- North America > United States > Massachusetts > Suffolk County > Boston (0.04)
- (3 more...)
- Information Technology > Artificial Intelligence > Vision (1.00)
- Information Technology > Artificial Intelligence > Natural Language (1.00)
- Information Technology > Artificial Intelligence > Speech (0.96)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)
ChatGPT-generated texts show authorship traits that identify them as non-human
Dentella, Vittoria, Huang, Weihang, Mansi, Silvia Angela, Grieve, Jack, Leivada, Evelina
Large Language Models can emulate different writing styles, ranging from composing poetry that appears indistinguishable from that of famous poets to using slan g that can convince people that they are chatting with a human online . While differences in style may not always be visible to the untrained eye, we can generally distinguish the writing of different people, like a linguistic fingerprint. This work examines whether a language model can also be linked to a specific fingerprint . Through stylometric and multidimensional register analys e s, w e compare human - authored and model - authored texts from different registers. We find that the model can successfully adapt its style depending on whether it is prompted to produce a Wikipedia entry vs. a college essay, but not in a way that makes it indistinguishable from human s . Concretely, the model shows more limited variation when producing outputs in different registers. O ur results suggest that the model prefers nouns to verbs, thus showing a distinct linguistic backbone from humans, who tend to anchor language in the highly grammaticalized dimensions of tense, aspect, and mood . It is possible that the more complex domains of grammar reflect a mode of thought unique to humans, thus acting as a litmus test for Artificial Intelligence. 2 Introduction Scholars from different disciplines have been addressing the question of what makes us human for centuries. For Nobel laureate Bertrand Russell, the answer is language, for "no matter how eloquently a dog may bark, he cannot tell you that his parents were poor but honest". H uman language is both flexible and constrained at the same time, and this is why the Turing Test, described as a litmus test for Artificial Intelligence [ Shieber 199 4, French 200 0], is linked to achieving a level of conversational proficiency that is highly complex, akin to that of a human [ Turing 1950 ] . Human language is flexible in the sense that we all make different choices when conversing. Every human is thought t o have a distinct linguistic fingerprint called idiolect [ Halliday et al. 196 4, Coulthard 2004 ] . This idiolect, which can be defined as an individual's unique use of linguistic forms (including lexical choices, collocations and fixed expressions, punctuation patterns, misspellings, and grammatical style), is critical for authorship attribution in a range of situations: from identifying that a poem with dashes, elliptical syntax, and unconventional capitalization is more likely authored by Emily Dickinson and not by William Shakespeare, to pinning down a person of interest in the course of a criminal investigation, as happened in the Unabomber case .
- Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.05)
- Europe > Italy (0.04)
- North America > United States > Nebraska (0.04)
- (6 more...)
- Law (0.48)
- Health & Medicine (0.46)
- North America > Canada > Ontario > Toronto (0.28)
- North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
- North America > United States > New York > New York County > New York City (0.04)
- (4 more...)
- North America > Canada > Ontario > Toronto (0.28)
- North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
- North America > United States > New York > New York County > New York City (0.04)
- (4 more...)
Stylometry recognizes human and LLM-generated texts in short samples
Przystalski, Karol, Argasiński, Jan K., Grabska-Gradzińska, Iwona, Ochab, Jeremi K.
The paper explores stylometry as a method to distinguish between texts created by Large Language Models (LLMs) and humans, addressing issues of model attribution, intellectual property, and ethical AI use. Stylometry has been used extensively to characterise the style and attribute authorship of texts. By applying it to LLM-generated texts, we identify their emergent writing patterns. The paper involves creating a benchmark dataset based on Wikipedia, with (a) human-written term summaries, (b) texts generated purely by LLMs (GPT-3.5/4, LLaMa 2/3, Orca, and Falcon), (c) processed through multiple text summarisation methods (T5, BART, Gensim, and Sumy), and (d) rephrasing methods (Dipper, T5). The 10-sentence long texts were classified by tree-based models (decision trees and LightGBM) using human-designed (StyloMetrix) and n-gram-based (our own pipeline) stylometric features that encode lexical, grammatical, syntactic, and punctuation patterns. The cross-validated results reached a performance of up to .87 Matthews correlation coefficient in the multiclass scenario with 7 classes, and accuracy between .79 and 1. in binary classification, with the particular example of Wikipedia and GPT-4 reaching up to .98 accuracy on a balanced dataset. Shapley Additive Explanations pinpointed features characteristic of the encyclopaedic text type, individual overused words, as well as a greater grammatical standardisation of LLMs with respect to human-written texts. These results show -- crucially, in the context of the increasingly sophisticated LLMs -- that it is possible to distinguish machine- from human-generated texts at least for a well-defined text type.
- Europe > Ukraine > Sumy Oblast > Sumy (0.25)
- Europe > Poland > Lesser Poland Province > Kraków (0.04)
- Europe > Switzerland (0.04)
- (9 more...)
- Research Report > New Finding (1.00)
- Overview (1.00)
- Law (1.00)
- Information Technology > Security & Privacy (1.00)
- Government (1.00)
Generative Modeling of Individual Behavior at Scale
Omi, Nabil, Caccia, Lucas, Sarkar, Anurag, Ash, Jordan T., Sen, Siddhartha
There has been a growing interest in using AI to model human behavior, particularly in domains where humans interact with this technology. While most existing work models human behavior at an aggregate level, our goal is to model behavior at the individual level. Recent approaches to behavioral stylometry -- or the task of identifying a person from their actions alone -- have shown promise in domains like chess, but these approaches are either not scalable (e.g., fine-tune a separate model for each person) or not generative, in that they cannot generate actions. We address these limitations by framing behavioral stylometry as a multi-task learning problem -- where each task represents a distinct person -- and use parameter-efficient fine-tuning (PEFT) methods to learn an explicit style vector for each person. Style vectors are generative: they selectively activate shared "skill" parameters to generate actions in the style of each person. They also induce a latent space that we can interpret and manipulate algorithmically. In particular, we develop a general technique for style steering that allows us to steer a player's style vector towards a desired property. We apply our approach to two very different games, at unprecedented scales: chess (47,864 players) and Rocket League (2,000 players). We also show generality beyond gaming by applying our method to image generation, where we learn style vectors for 10,177 celebrities and use these vectors to steer their images.
- North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
- Europe > Croatia > Dubrovnik-Neretva County > Dubrovnik (0.04)
- North America > Dominican Republic (0.04)
- (3 more...)
Stylomech: Unveiling Authorship via Computational Stylometry in English and Romanized Sinhala
Faumi, Nabeelah, Gunathilake, Adeepa, Wickramanayake, Benura, Dias, Deelaka, Sumanathilaka, TGDK
With the advent of Web 2.0, the development in social technology coupled with global communication systematically brought positive and negative impacts to society. Copyright claims and Author identification are deemed crucial as there has been a considerable amount of increase in content violation owing to the lack of proper ethics in society. The Author's attribution in both English and Romanized Sinhala became a major requirement in the last few decades. As an area largely unexplored, particularly within the context of Romanized Sinhala, the research contributes significantly to the field of computational linguistics. The proposed author attribution system offers a unique approach, allowing for the comparison of only two sets of text: suspect author and anonymous text, a departure from traditional methodologies which often rely on larger corpora. This work focuses on using the numerical representation of various pairs of the same and different authors allowing for, the model to train on these representations as opposed to text, this allows for it to apply to a multitude of authors and contexts, given that the suspected author text, and the anonymous text are of reasonable quality. By expanding the scope of authorship attribution to encompass diverse linguistic contexts, the work contributes to fostering trust and accountability in digital communication, especially in Sri Lanka. This research presents a pioneering approach to author attribution in both English and Romanized Sinhala, addressing a critical need for content verification and intellectual property rights enforcement in the digital age.
- Asia > Sri Lanka (0.25)
- North America > United States > California > Santa Clara County > San Jose (0.04)
- North America > United States > California > Orange County > Anaheim (0.04)
- (2 more...)
- Research Report > Promising Solution (0.34)
- Overview > Innovation (0.34)
PetKaz at SemEval-2024 Task 8: Can Linguistics Capture the Specifics of LLM-generated Text?
Petukhova, Kseniia, Kazakov, Roman, Kochmar, Ekaterina
In this paper, we present our submission to the SemEval-2024 Task 8 "Multigenerator, Multidomain, and Multilingual Black-Box Machine-Generated Text Detection", focusing on the detection of machine-generated texts (MGTs) in English. Specifically, our approach relies on combining embeddings from the RoBERTa-base with diversity features and uses a resampled training set. We score 12th from 124 in the ranking for Subtask A (monolingual track), and our results show that our approach is generalizable across unseen models and domains, achieving an accuracy of 0.91.
- North America > United States > Michigan > Washtenaw County > Ann Arbor (0.04)
- North America > Mexico > Mexico City > Mexico City (0.04)
- North America > Canada > Quebec > Montreal (0.04)
- (2 more...)
Defending Against Authorship Identification Attacks
Authorship identification has proven unsettlingly effective in inferring the identity of the author of an unsigned document, even when sensitive personal information has been carefully omitted. In the digital era, individuals leave a lasting digital footprint through their written content, whether it is posted on social media, stored on their employer's computers, or located elsewhere. When individuals need to communicate publicly yet wish to remain anonymous, there is little available to protect them from unwanted authorship identification. This unprecedented threat to privacy is evident in scenarios such as whistle-blowing. Proposed defenses against authorship identification attacks primarily aim to obfuscate one's writing style, thereby making it unlinkable to their pre-existing writing, while concurrently preserving the original meaning and grammatical integrity. The presented work offers a comprehensive review of the advancements in this research area spanning over the past two decades and beyond. It emphasizes the methodological frameworks of modification and generation-based strategies devised to evade authorship identification attacks, highlighting joint efforts from the differential privacy community. Limitations of current research are discussed, with a spotlight on open challenges and potential research avenues.
- North America > United States > Washington > King County > Seattle (0.14)
- Europe > Austria > Vienna (0.14)
- North America > United States > New York > New York County > New York City (0.04)
- (32 more...)
- Research Report > New Finding (1.00)
- Overview (1.00)
- Research Report > Experimental Study (0.67)
- Information Technology > Security & Privacy (1.00)
- Government > Regional Government > North America Government > United States Government (1.00)
- Law > Civil Rights & Constitutional Law (0.92)
- Media (0.92)