Goto

Collaborating Authors

 plagiarism


Overview of the Plagiarism Detection Task at PAN 2025

Greiner-Petter, André, Fröbe, Maik, Wahle, Jan Philip, Ruas, Terry, Gipp, Bela, Aizawa, Akiko, Potthast, Martin

arXiv.org Artificial Intelligence

The generative plagiarism detection task at PAN 2025 aims at identifying automatically generated textual plagiarism in scientific articles and aligning them with their respective sources. We created a novel large-scale dataset of automatically generated plagiarism using three large language models: Llama, DeepSeek-R1, and Mistral. In this task overview paper, we outline the creation of this dataset, summarize and compare the results of all participants and four baselines, and evaluate the results on the last plagiarism detection task from PAN 2015 in order to interpret the robustness of the proposed approaches. We found that the current iteration does not invite a large variety of approaches as naive semantic similarity approaches based on embedding vectors provide promising results of up to 0.8 recall and 0.5 precision. In contrast, most of these approaches underperform significantly on the 2015 dataset, indicating a lack in generalizability.


AI is infiltrating the classroom. Here's how teachers and students say they use it

Los Angeles Times

Things to Do in L.A. Tap to enable a layout that focuses on the article. AI is infiltrating the classroom. Here's how teachers and students say they use it A ChapGPT logo is seen in December 2023. This is read by an automated voice. Please report any issues or inconsistencies here .


The Provenance Problem: LLMs and the Breakdown of Citation Norms

Earp, Brian D., Yuan, Haotian, Koplin, Julian, Mann, Sebastian Porsdam

arXiv.org Artificial Intelligence

The increasing use of generative AI in scientific writing raises urgent questions about attribution and intellectual credit. When a researcher employs ChatGPT to draft a manuscript, the resulting text may echo ideas from sources the author has never encountered. If an AI system reproduces insights from, for example, an obscure 1975 paper without citation, does this constitute plagiarism? We argue that such cases exemplify the 'provenance problem': a systematic breakdown in the chain of scholarly credit. Unlike conventional plagiarism, this phenomenon does not involve intent to deceive (researchers may disclose AI use and act in good faith) yet still benefit from the uncredited intellectual contributions of others. This dynamic creates a novel category of attributional harm that current ethical and professional frameworks fail to address. As generative AI becomes embedded across disciplines, the risk that significant ideas will circulate without recognition threatens both the reputational economy of science and the demands of epistemic justice. This Perspective analyzes how AI challenges established norms of authorship, introduces conceptual tools for understanding the provenance problem, and proposes strategies to preserve integrity and fairness in scholarly communication.


Matrix-Driven Instant Review: Confident Detection and Reconstruction of LLM Plagiarism on PC

Zhang, Ruichong

arXiv.org Artificial Intelligence

In recent years, concerns about intellectual property (IP) in large language models (LLMs) have grown significantly. Plagiarizing other LLMs (through direct weight copying, upcycling, pruning, or continual pretraining) and claiming authorship without properly attributing to the original license, is a serious misconduct that can lead to significant financial and reputational harm to the original developers. However, existing methods for detecting LLM plagiarism fall short in key areas. They fail to accurately reconstruct weight correspondences, lack the ability to compute statistical significance measures such as $p$-values, and may mistakenly flag models trained on similar data as being related. To address these limitations, we propose Matrix-Driven Instant Review (MDIR), a novel method that leverages matrix analysis and Large Deviation Theory. MDIR achieves accurate reconstruction of weight relationships, provides rigorous $p$-value estimation, and focuses exclusively on weight similarity without requiring full model inference. Experimental results demonstrate that MDIR reliably detects plagiarism even after extensive transformations, such as random permutations and continual pretraining with trillions of tokens. Moreover, all detections can be performed on a single PC within an hour, making MDIR both efficient and accessible.


BMDetect: A Multimodal Deep Learning Framework for Comprehensive Biomedical Misconduct Detection

Zhou, Yize, Zhang, Jie, Wang, Meijie, Yu, Lun

arXiv.org Artificial Intelligence

Academic misconduct detection in biomedical research remains challenging due to algorithmic narrowness in existing methods and fragmented analytical pipelines. We present BMDetect, a multimodal deep learning framework that integrates journal metadata (SJR, institutional data), semantic embeddings (PubMedBERT), and GPT-4o-mined textual attributes (methodological statistics, data anomalies) for holistic manuscript evaluation. Key innovations include: (1) multimodal fusion of domain-specific features to reduce detection bias; (2) quantitative evaluation of feature importance, identifying journal authority metrics (e.g., SJR-index) and textual anomalies (e.g., statistical outliers) as dominant predictors; and (3) the BioMCD dataset, a large-scale benchmark with 13,160 retracted articles and 53,411 controls. BMDetect achieves 74.33% AUC, outperforming single-modality baselines by 8.6%, and demonstrates transferability across biomedical subfields. This work advances scalable, interpretable tools for safeguarding research integrity.


Assessing the Prevalence of AI-assisted Cheating in Programming Courses: A Pilot Study

Delphino, Kaléu

arXiv.org Artificial Intelligence

Abstract-- Tools that can generate computer code in response to inputs written in natural language, such as ChatGPT, pose an existential threat to Computer Science education in its current form, since students can now use these tools to solve assignments without much effort. While that risk has already been recognized by scholars, the proportion of the student body that is incurring in this new kind of plagiarism is still an open problem. We conducted a pilot study in a large CS class (n=120) to assess the feasibility of estimating AI plagiarism through anonymous surveys and interviews. More than 25% of the survey respondents admitted to committing AI plagiarism. Conversely, only one student accepted to be interviewed. Given the high levels of misconduct acknowledgment, we conclude that surveys are an effective method for studies on the matter, while interviews should be avoided or designed in a way that can entice participation. 1 INTRODUCTION Generative artificial intelligence (GenAI, not to be confused with general The generation is usually guided by an input text known as the "prompt". For example, giving the prompt "a vase of red flowers" to a GenAI model would generate an image depicting red flowers in a vase. Practical applications of GenAI are now mainstream thanks to advances in neural networks. In particular, the clever use of attention mechanisms and the subsequent development of the transformer architecture made efficient learning possible over large text corpora (Vaswani et al., 2023) . AI application based on a LLM, can convincingly engage in a conversation and answer questions across multiple subjects (OpenAI, 2022) . Research on applications of LLMs in education is still in its infancy, but looks promising. Personal tutoring systems (Chang, 2022), content explanation (Leinonen et al., 2023) and assignment generation ( Jury et al., 2024) are a few of the ideas that have been explored. From another perspective, LLMs are already a reality in schools.


Revealed: Thousands of UK university students caught cheating using AI

The Guardian

Thousands of university students in the UK have been caught misusing ChatGPT and other artificial intelligence tools in recent years, while traditional forms of plagiarism show a marked decline, a Guardian investigation can reveal. A survey of academic integrity violations found almost 7,000 proven cases of cheating using AI tools in 2023-24, equivalent to 5.1 for every 1,000 students. That was up from 1.6 cases per 1,000 in 2022-23. Figures up to May suggest that number will increase again this year to about 7.5 proven cases per 1,000 students – but recorded cases represent only the tip of the iceberg, according to experts. The data highlights a rapidly evolving challenge for universities: trying to adapt assessment methods to the advent of technologies such as ChatGPT and other AI-powered writing tools.


Fox News AI Newsletter: Hollywood studios sue 'bottomless pit of plagiarism'

FOX News

The Minions pose during the world premiere of the film "Despicable Me 4" in New York City, June 9, 2024. The website of Midjourney, an artificial intelligence (AI) capable of creating AI art, is seen on a smartphone on April 3, 2023, in Berlin, Germany. 'PIRACY IS PIRACY': Two major Hollywood studios are suing Midjourney, a popular AI image generator, over its use and distribution of intellectual property. AI RACE: Meta CEO Mark Zuckerberg is reportedly building a team of experts to develop artificial general intelligence (AGI) that can meet or exceed human capabilities. TECH HUB: New York is poised to play a central role in the development of artificial intelligence (AI), OpenAI executives told key business and civic leaders on Tuesday.


Disney and Universal sue AI image creator Midjourney, alleging copyright infringement

The Guardian

In their lawsuit, the entertainment giants called Midjourney's popular AI-powered image generator a "bottomless pit of plagiarism" for its alleged reproductions of the studios' best-known characters. The suit, filed in federal court in Los Angeles, claims Midjourney pirated the libraries of the two Hollywood studios, making and distributing without permission "innumerable" copies of their marquee characters such as Darth Vader from Star Wars, Elsa from Frozen, and the Minions from Despicable Me. Midjourney did not immediately respond to a request for comment. Horacio Gutierrez, Disney's chief legal officer, said in a statement: "We are bullish on the promise of AI technology and optimistic about how it can be used responsibly as a tool to further human creativity, but piracy is piracy, and the fact that it's done by an AI company does not make it any less infringing." NBCUniversal's executive vice-president and general counsel, Kim Harris, said the company was suing to "protect the hard work of all the artists whose work entertains and inspires us and the significant investment we make in our content". Instead, the studios argue, Midjourney continued to release new versions of its AI image service that boast higher-quality infringing images.


All That Glitters is Not Novel: Plagiarism in AI Generated Research

Gupta, Tarun, Pruthi, Danish

arXiv.org Artificial Intelligence

Automating scientific research is considered the final frontier of science. Recently, several papers claim autonomous research agents can generate novel research ideas. Amidst the prevailing optimism, we document a critical concern: a considerable fraction of such research documents are smartly plagiarized. Unlike past efforts where experts evaluate the novelty and feasibility of research ideas, we request $13$ experts to operate under a different situational logic: to identify similarities between LLM-generated research documents and existing work. Concerningly, the experts identify $24\%$ of the $50$ evaluated research documents to be either paraphrased (with one-to-one methodological mapping), or significantly borrowed from existing work. These reported instances are cross-verified by authors of the source papers. Problematically, these LLM-generated research documents do not acknowledge original sources, and bypass inbuilt plagiarism detectors. Lastly, through controlled experiments we show that automated plagiarism detectors are inadequate at catching deliberately plagiarized ideas from an LLM. We recommend a careful assessment of LLM-generated research, and discuss the implications of our findings on research and academic publishing.