copying
New York Times sues AI startup for 'illegal' copying of millions of articles
New York Times newspaper office building is seen in Manhattan on 26 October 2022. New York Times newspaper office building is seen in Manhattan on 26 October 2022. The New York Times sued an embattled artificial intelligence startup on Friday, accusing the firm of illegally copying millions of articles. The newspaper alleged Perplexity AI had distributed and displayed journalists' work without permission en masse. The Times said that Perplexity AI was also violating its trademarks under the Lanham Act, claiming the startup's generative AI products create fabricated content, or "hallucinations", and falsely attribute them to the newspaper by displaying them alongside its registered trademarks.
- North America > United States > New York (0.07)
- Europe > Ukraine (0.07)
- Oceania > Australia (0.05)
- (2 more...)
- Media > News (1.00)
- Law > Intellectual Property & Technology Law (1.00)
- Government > Regional Government > North America Government > United States Government (0.53)
Blameless Users in a Clean Room: Defining Copyright Protection for Generative Models
Are there any conditions under which a generative model's outputs are guaranteed not to infringe the copyrights of its training data? This is the question of "provable copyright protection" first posed by Vyas, Kakade, and Barak (ICML 2023). They define near access-freeness (NAF) and propose it as sufficient for protection. This paper revisits the question and establishes new foundations for provable copyright protection -- foundations that are firmer both technically and legally. First, we show that NAF alone does not prevent infringement. In fact, NAF models can enable verbatim copying, a blatant failure of copy protection that we dub being tainted. Then, we introduce our blameless copy protection framework for defining meaningful guarantees, and instantiate it with clean-room copy protection. Clean-room copy protection allows a user to control their risk of copying by behaving in a way that is unlikely to copy in a counterfactual clean-room setting. Finally, we formalize a common intuition about differential privacy and copyright by proving that DP implies clean-room copy protection when the dataset is golden, a copyright deduplication requirement.
- North America > United States > Pennsylvania (0.04)
- North America > United States > New York > New York County > New York City (0.04)
- North America > United States > Illinois > Cook County > Chicago (0.04)
Understanding Cross Task Generalization in Handwriting-Based Alzheimer's Screening via Vision Language Adaptation
Gong, Changqing, Qin, Huafeng, El-Yacoubi, Mounim A.
Alzheimer's disease is a prevalent neurodegenerative disorder for which early detection is critical. Handwriting-often disrupted in prodromal AD-provides a non-invasive and cost-effective window into subtle motor and cognitive decline. Existing handwriting-based AD studies, mostly relying on online trajectories and hand-crafted features, have not systematically examined how task type influences diagnostic performance and cross-task generalization. Meanwhile, large-scale vision language models have demonstrated remarkable zero or few-shot anomaly detection in natural images and strong adaptability across medical modalities such as chest X-ray and brain MRI. However, handwriting-based disease detection remains largely unexplored within this paradigm. To close this gap, we introduce a lightweight Cross-Layer Fusion Adapter framework that repurposes CLIP for handwriting-based AD screening. CLFA implants multi-level fusion adapters within the visual encoder to progressively align representations toward handwriting-specific medical cues, enabling prompt-free and efficient zero-shot inference. Using this framework, we systematically investigate cross-task generalization-training on a specific handwriting task and evaluating on unseen ones-to reveal which task types and writing patterns most effectively discriminate AD. Extensive analyses further highlight characteristic stroke patterns and task-level factors that contribute to early AD identification, offering both diagnostic insights and a benchmark for handwriting-based cognitive assessment.
- Asia > China > Chongqing Province > Chongqing (0.04)
- North America > United States > New Mexico > Bernalillo County > Albuquerque (0.04)
- Europe > France (0.04)
- Asia > South Korea (0.04)
Born a Transformer -- Always a Transformer? On the Effect of Pretraining on Architectural Abilities
Jobanputra, Mayank, Veitsman, Yana, Sarrof, Yash, Bakalova, Aleksandra, Demberg, Vera, Pavlick, Ellie, Hahn, Michael
Transformers have theoretical limitations in modeling certain sequence-to-sequence tasks, yet it remains largely unclear if these limitations play a role in large-scale pretrained LLMs, or whether LLMs might effectively overcome these constraints in practice due to the scale of both the models themselves and their pretraining data. We explore how these architectural constraints manifest after pretraining, by studying a family of $\textit{retrieval}$ and $\textit{copying}$ tasks inspired by Liu et al. [2024a]. We use a recently proposed framework for studying length generalization [Huang et al., 2025] to provide guarantees for each of our settings. Empirically, we observe an $\textit{induction-versus-anti-induction}$ asymmetry, where pretrained models are better at retrieving tokens to the right (induction) rather than the left (anti-induction) of a query token. This asymmetry disappears upon targeted fine-tuning if length-generalization is guaranteed by theory. Mechanistic analysis reveals that this asymmetry is connected to the differences in the strength of induction versus anti-induction circuits within pretrained transformers. We validate our findings through practical experiments on real-world tasks demonstrating reliability risks. Our results highlight that pretraining selectively enhances certain transformer capabilities, but does not overcome fundamental length-generalization limits.
- Europe > Austria > Vienna (0.14)
- North America > United States (0.14)
- Europe > Germany > Saarland (0.04)
- (3 more...)
- Research Report > New Finding (1.00)
- Research Report > Experimental Study (1.00)
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
- Information Technology > Artificial Intelligence > Natural Language > Chatbot (0.93)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.93)
- Information Technology > Artificial Intelligence > Representation & Reasoning (0.93)
Unfair Learning: GenAI Exceptionalism and Copyright Law
It examines fair use legal arguments and eight distinct substantive arguments, contending that every legal and substantive argument favoring fair use for GenAI applies equally, if not more so, to humans. Therefore, granting GenAI exceptional privileges in this domain is legally and logically inco nsistent with withholding broad fair use exemptions from individual humans.
- North America > United States > Texas > Travis County > Austin (0.04)
- Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
- Law > Intellectual Property & Technology Law (1.00)
- Government > Regional Government > North America Government > United States Government (1.00)
- Media > Film (0.93)
- Law > Litigation (0.93)
Mimetic Initialization Helps State Space Models Learn to Recall
Trockman, Asher, Harutyunyan, Hrayr, Kolter, J. Zico, Kumar, Sanjiv, Bhojanapalli, Srinadh
Recent work has shown that state space models such as Mamba are significantly worse than Transformers on recall-based tasks due to the fact that their state size is constant with respect to their input sequence length. But in practice, state space models have fairly large state sizes, and we conjecture that they should be able to perform much better at these tasks than previously reported. We investigate whether their poor copying and recall performance could be due in part to training difficulties rather than fundamental capacity constraints. Based on observations of their "attention" maps, we propose a structured initialization technique that allows state space layers to more readily mimic attention. Across a variety of architecture settings, our initialization makes it substantially easier for Mamba to learn to copy and do associative recall from scratch.
Language Models "Grok" to Copy
Lv, Ang, Xie, Ruobing, Sun, Xingwu, Kang, Zhanhui, Yan, Rui
We examine the pre-training dynamics of language models, focusing on their ability to copy text from preceding context--a fundamental skill for various LLM applications, including in-context learning (ICL) and retrieval-augmented generation (RAG). We propose a novel perspective that Transformer-based language models develop copying abilities similarly to grokking, which refers to sudden generalization on test set long after the model fit to the training set. Our experiments yield three arguments: (1) The pre-training loss decreases rapidly, while the context copying ability of models initially lags and then abruptly saturates. (2) The speed of developing copying ability is independent of the number of tokens trained, similarly to how grokking speed is unaffected by dataset size as long as the data distribution is preserved. (3) Induction heads, the attention heads responsible for copying, form from shallow to deep layers during training, mirroring the development of circuits in deeper layers during grokking. We contend that the connection between grokking and context copying can provide valuable insights for more effective language model training, ultimately improving in-context performance. For example, we demonstrated that techniques that enhance grokking, such as regularization, either accelerate or enhance the development of context copying.
Towards a Cyber Information Ontology
Limbaugh, David, Jensen, Mark, Beverley, John
This paper introduces a set of terms that are intended to act as an interface between cyber ontologies (like a file system ontology or a data fusion ontology) and top- and mid-level ontologies, specifically Basic Formal Ontology and the Common Core Ontologies. These terms center on what makes cyberinformation management unique: numerous acts of copying items of information, the aggregates of copies that result from those acts, and the faithful members of those aggregates that represent all other members.
- North America > United States > New York > New York County > New York City (0.04)
- North America > United States > Virginia > Fairfax County > Fairfax (0.04)
- North America > United States > New York > Erie County > Buffalo (0.04)
- (2 more...)
CopyBench: Measuring Literal and Non-Literal Reproduction of Copyright-Protected Text in Language Model Generation
Chen, Tong, Asai, Akari, Mireshghallah, Niloofar, Min, Sewon, Grimmelmann, James, Choi, Yejin, Hajishirzi, Hannaneh, Zettlemoyer, Luke, Koh, Pang Wei
Evaluating the degree of reproduction of copyright-protected content by language models (LMs) is of significant interest to the AI and legal communities. Although both literal and non-literal similarities are considered by courts when assessing the degree of reproduction, prior research has focused only on literal similarities. To bridge this gap, we introduce CopyBench, a benchmark designed to measure both literal and non-literal copying in LM generations. Using copyrighted fiction books as text sources, we provide automatic evaluation protocols to assess literal and non-literal copying, balanced against the model utility in terms of the ability to recall facts from the copyrighted works and generate fluent completions. We find that, although literal copying is relatively rare, two types of non-literal copying -- event copying and character copying -- occur even in models as small as 7B parameters. Larger models demonstrate significantly more copying, with literal copying rates increasing from 0.2% to 10.5% and non-literal copying from 2.3% to 6.9% when comparing Llama3-8B and 70B models, respectively. We further evaluate the effectiveness of current strategies for mitigating copying and show that (1) training-time alignment can reduce literal copying but may increase non-literal copying, and (2) current inference-time mitigation methods primarily reduce literal but not non-literal copying.
- Asia > Singapore (0.04)
- North America > United States > Alabama (0.04)
- North America > United States > Texas > Travis County > Austin (0.04)
- (4 more...)
- Law > Intellectual Property & Technology Law (1.00)
- Leisure & Entertainment (0.94)
The Flaw That Could Ruin Generative AI
And because a LLM doesn't "know" when it's quoting from training data, there's no obvious way to prevent the behavior. I spoke with Florian Tramèr, a prominent AI-security researcher and co-author of some of the above studies. It's "an extremely tricky problem to study," he told me. "It's very, very hard to pin down a good definition of memorization." One way to understand the concept is to think of an LLM as an enormous decision tree in which each node is an English word. From a given starting word, an LLM chooses the next word from the entire English vocabulary.
- Law > Litigation (1.00)
- Law > Intellectual Property & Technology Law (0.96)
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
- Information Technology > Artificial Intelligence > Natural Language > Chatbot (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning > Generative AI (0.60)