licensing
Professional PDF Solutions for Teams: Scale Without Breaking Your Budget
Cut complexity, control costs, and boost productivity with powerful PDF and eSign solutions. Discover how AI-powered PDF solutions help teams reduce costs, automate workflows, and scale document management with predictable pricing and flexible licensing. Why do so many PDF editing and eSignature tools fail to scale across teams? The good news is that even teams that process a high volume of documents can reduce their costs and get more value from their PDF solutions by selecting solutions with built-in AI, predictable pricing, flexible licensing, and scalable, automated document workflows. Many PDF tools were designed for individuals or small teams rather than scaling teams.
No Language Data Left Behind: A Comparative Study of CJK Language Datasets in the Hugging Face Ecosystem
Choi, Dasol, Park, Woomyoung, Song, Youngsook
Recent advances in Natural Language Processing (NLP) have underscored the crucial role of high-quality datasets in building large language models (LLMs). However, while extensive resources and analyses exist for English, the landscape for East Asian languages - particularly Chinese, Japanese, and Korean (CJK) - remains fragmented and underexplored, despite these languages together serving over 1.6 billion speakers. To address this gap, we investigate the HuggingFace ecosystem from a cross-linguistic perspective, focusing on how cultural norms, research environments, and institutional practices shape dataset availability and quality. Drawing on more than 3,300 datasets, we employ quantitative and qualitative methods to examine how these factors drive distinct creation and curation patterns across Chinese, Japanese, and Korean NLP communities. Our findings highlight the large-scale and often institution-driven nature of Chinese datasets, grassroots community-led development in Korean NLP, and an entertainment- and subculture-focused emphasis on Japanese collections. By uncovering these patterns, we reveal practical strategies for enhancing dataset documentation, licensing clarity, and cross-lingual resource sharing - ultimately guiding more effective and culturally attuned LLM development in East Asia. We conclude by discussing best practices for future dataset curation and collaboration, aiming to strengthen resource development across all three languages.
Scott Farquhar thinks Australia should let AI train for free on creative content. He overlooks one key point
Farquhar, the Tech Council of Australia CEO, told ABC's 7.30 program on Tuesday: "all AI usage of mining or searching or going across data is probably illegal under Australian law and I think that hurts a lot of investment of these companies in Australia". Farquhar's claim overlooks that this is not a settled issue in the US, and could have devastating effects on creative industries. Farquhar's argument is that it is not theft of people's work unless the AI is used to "copy an artist directly" such as creating a song in their style. "I do think people would say that, hey, if people are going to sit down with a digital companion, an AI song creator and they collaboratively work with an AI to create something new to the world, that's probably fair use." Farquhar said the benefits of large language models outweigh the issues raised by AI training its data on other people's work for free.
Towards Best Practices for Open Datasets for LLM Training
Baack, Stefan, Biderman, Stella, Odrozek, Kasia, Skowron, Aviya, Bdeir, Ayah, Bommarito, Jillian, Ding, Jennifer, Gahntz, Maximilian, Keller, Paul, Langlais, Pierre-Carl, Lindahl, Greg, Majstorovic, Sebastian, Marda, Nik, Penedo, Guilherme, Van Segbroeck, Maarten, Wang, Jennifer, von Werra, Leandro, Baker, Mitchell, Beliรฃo, Julie, Chmielinski, Kasia, Fadaee, Marzieh, Gutermuth, Lisa, Kydlรญฤek, Hynek, Leppert, Greg, Lewis-Jong, EM, Larsen, Solana, Longpre, Shayne, Lungati, Angela Oduor, Miller, Cullen, Miller, Victor, Ryabinin, Max, Siminyu, Kathleen, Strait, Andrew, Surman, Mark, Tumadรณttir, Anna, Weber, Maurice, Weiss, Rebecca, White, Lee, Wolf, Thomas
Many AI companies are training their large language models (LLMs) on data without the permission of the copyright owners. The permissibility of doing so varies by jurisdiction: in countries like the EU and Japan, this is allowed under certain restrictions, while in the United States, the legal landscape is more ambiguous. Regardless of the legal status, concerns from creative producers have led to several high-profile copyright lawsuits, and the threat of litigation is commonly cited as a reason for the recent trend towards minimizing the information shared about training datasets by both corporate and public interest actors. This trend in limiting data information causes harm by hindering transparency, accountability, and innovation in the broader ecosystem by denying researchers, auditors, and impacted individuals access to the information needed to understand AI models. While this could be mitigated by training language models on open access and public domain data, at the time of writing, there are no such models (trained at a meaningful scale) due to the substantial technical and sociological challenges in assembling the necessary corpus. These challenges include incomplete and unreliable metadata, the cost and complexity of digitizing physical records, and the diverse set of legal and technical skills required to ensure relevance and responsibility in a quickly changing landscape. Building towards a future where AI systems can be trained on openly licensed data that is responsibly curated and governed requires collaboration across legal, technical, and policy domains, along with investments in metadata standards, digitization, and fostering a culture of openness.
"They've Stolen My GPL-Licensed Model!": Toward Standardized and Transparent Model Licensing
Duan, Moming, Zhao, Rui, Jiang, Linshan, Shadbolt, Nigel, He, Bingsheng
As model parameter sizes reach the billion-level range and their training consumes zettaFLOPs of computation, components reuse and collaborative development are become increasingly prevalent in the Machine Learning (ML) community. These components, including models, software, and datasets, may originate from various sources and be published under different licenses, which govern the use and distribution of licensed works and their derivatives. However, commonly chosen licenses, such as GPL and Apache, are software-specific and are not clearly defined or bounded in the context of model publishing. Meanwhile, the reused components may also have free-content licenses and model licenses, which pose a potential risk of license noncompliance and rights infringement within the model production workflow. In this paper, we propose addressing the above challenges along two lines: 1) For license analysis, we have developed a new vocabulary for ML workflow management and encoded license rules to enable ontological reasoning for analyzing rights granting and compliance issues. 2) For standardized model publishing, we have drafted a set of model licenses that provide flexible options to meet the diverse needs of model publishing. Our analysis tool is built on Turtle language and Notation3 reasoning engine, envisioned as a first step toward Linked Open Model Production Data. We have also encoded our proposed model licenses into rules and demonstrated the effects of GPL and other commonly used licenses in model publishing, along with the flexibility advantages of our licenses, through comparisons and experiments.
This Startup Wants YouTube Creators to Get Paid for AI Training Data
So far, when AI companies have trained on YouTube's invaluable stash of videos, captions, and other content, they've done so without permission. An AI-focused content licensing startup called Calliope Networks is hoping to change that with its new "License to Scrape," a program aimed directly at YouTube stars. "There's obvious demand from AI companies to scrape YouTube content. We see that by their actions. So what we're trying to do is to create a tool that makes it legal and simple for them," says Calliope Networks CEO Dave Davis.
A New Group Is Trying to Make AI Data Licensing Ethical
The first wave of major generative AI tools largely were trained on "publicly available" data--basically, anything and everything that could be scraped from the internet. Now, sources of training data are increasingly restricting access and pushing for licensing agreements. With the hunt for additional data sources intensifying, new licensing startups have emerged to keep the source material flowing. The Dataset Providers Alliance, a trade group formed this summer, wants to make the AI industry more standardized and fair. To that end, it has just released a position paper outlining its stances on major AI-related issues.
Journalists Had 'No Idea' About OpenAI's Deal to Use Their Stories
Last week, OpenAI and the German media conglomerate Axel Springer signed a multi-year licensing agreement. It allows OpenAI to incorporate articles from Axel Springerโowned outlets like Business Insider and Politico into its products, including ChatGPT. Although the deal centers on using journalistic work, reporters whose stories will be shared as part of the agreement were not consulted about the deal beforehand. Four Business Insider employees told WIRED that they found out about the AI deal at the same time it was announced publicly. PEN Guild, the US union which represents around 280 workers at Politico and E&E News, another Axel Springer publication, says it was "not consulted or informed about the decision to have robots summarize our work."
DP-TBART: A Transformer-based Autoregressive Model for Differentially Private Tabular Data Generation
Castellon, Rodrigo, Gopal, Achintya, Bloniarz, Brian, Rosenberg, David
The generation of synthetic tabular data that preserves differential privacy is a problem of growing importance. While traditional marginal-based methods have achieved impressive results, recent work has shown that deep learning-based approaches tend to lag behind. In this work, we present Differentially-Private TaBular AutoRegressive Transformer (DP-TBART), a transformer-based autoregressive model that maintains differential privacy and achieves performance competitive with marginal-based methods on a wide variety of datasets, capable of even outperforming state-of-the-art methods in certain settings. We also provide a theoretical framework for understanding the limitations of marginal-based approaches and where deep learning-based approaches stand to contribute most. These results suggest that deep learning-based techniques should be considered as a viable alternative to marginal-based methods in the generation of differentially private synthetic tabular data.
Whose Text Is It Anyway? Exploring BigCode, Intellectual Property, and Ethics
Choksi, Madiha Zahrah, Goedicke, David
Intelligent or generative writing tools rely on large language models that recognize, summarize, translate, and predict content. This position paper probes the copyright interests of open data sets used to train large language models (LLMs). Our paper asks, how do LLMs trained on open data sets circumvent the copyright interests of the used data? We start by defining software copyright and tracing its history. We rely on GitHub Copilot as a modern case study challenging software copyright. Our conclusion outlines obstacles that generative writing assistants create for copyright, and offers a practical road map for copyright analysis for developers, software law experts, and general users to consider in the context of intelligent LLM-powered writing tools.