Do Language Models Plagiarize?

Lee, Jooyoung, Le, Thai, Chen, Jinghui, Lee, Dongwon

Feb-13-2023–arXiv.org Artificial Intelligence

In this work, therefore, we study three types of plagiarism (i.e., verbatim, paraphrase, and idea) among GPT-2 generated texts, Language Models (LMs) have become core elements of Natural in comparison to its training data, and further analyze the plagiarism Language Processing (NLP) solutions, excelling in a wide range of patterns of fine-tuned LMs with domain-specific corpora which are tasks such as natural language generation (NLG), speech recognition, extensively used in practice. Our results suggest that (1) three types machine translation, and question answering. The development of plagiarism widely exist in LMs beyond memorization, (2) both of large-scale text corpora (generally scraped from the Web) has size and decoding methods of LMs are strongly associated with the enabled researchers to train increasingly large-scale LMs. Especially, degrees of plagiarism they exhibit, and (3) fine-tuned LMs' plagiarism large-scale LMs have demonstrated unprecedented performance on patterns vary based on their corpus similarity and homogeneity. NLG such that LM-generated texts routinely show more novel and Given that a majority of LMs' training data is scraped from the Web interesting stories than human writings do [35], and the distinction without informing content owners, their reiteration of words, phrases, between machine-authored and human-written texts has become and even core ideas from training sets into generated texts has ethical non-trivial [52, 53]. As a result, there has been a significant increase implications. Their patterns are likely to exacerbate as both in the use of LMs in user-facing products and critical applications.

artificial intelligence, machine learning, natural language, (18 more...)

arXiv.org Artificial Intelligence

Feb-13-2023

arXiv.org PDF

Add feedback

Country:
- North America > United States (1.00)

Genre:
- Research Report > New Finding (1.00)

Industry:
- Education (0.69)
- Government > Regional Government
  - North America Government > United States Government (0.68)
- Health & Medicine > Therapeutic Area
  - Infections and Infectious Diseases (0.93)
- Information Technology > Security & Privacy (1.00)

Technology:
- Information Technology > Artificial Intelligence
  - Machine Learning
    - Neural Networks > Deep Learning (0.70)
    - Statistical Learning (0.68)
  - Natural Language
    - Machine Translation (0.66)
    - Text Processing (1.00)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found