AITopics | licensing

Collaborating Authors

licensing

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

Professional PDF Solutions for Teams: Scale Without Breaking Your Budget

PCWorldMay-13-2026, 19:24:59 GMT

Cut complexity, control costs, and boost productivity with powerful PDF and eSign solutions. Discover how AI-powered PDF solutions help teams reduce costs, automate workflows, and scale document management with predictable pricing and flexible licensing. Why do so many PDF editing and eSignature tools fail to scale across teams? The good news is that even teams that process a high volume of documents can reduce their costs and get more value from their PDF solutions by selecting solutions with built-in AI, predictable pricing, flexible licensing, and scalable, automated document workflows. Many PDF tools were designed for individuals or small teams rather than scaling teams.

artificial intelligence, buyer, natural language, (12 more...)

PCWorld

Industry:

Information Technology > Security & Privacy (1.00)
Leisure & Entertainment > Games > Computer Games (0.54)

Technology: Information Technology > Artificial Intelligence > Natural Language (0.30)

Add feedback

No Language Data Left Behind: A Comparative Study of CJK Language Datasets in the Hugging Face Ecosystem

Choi, Dasol, Park, Woomyoung, Song, Youngsook

arXiv.org Artificial IntelligenceOct-16-2025

Recent advances in Natural Language Processing (NLP) have underscored the crucial role of high-quality datasets in building large language models (LLMs). However, while extensive resources and analyses exist for English, the landscape for East Asian languages - particularly Chinese, Japanese, and Korean (CJK) - remains fragmented and underexplored, despite these languages together serving over 1.6 billion speakers. To address this gap, we investigate the HuggingFace ecosystem from a cross-linguistic perspective, focusing on how cultural norms, research environments, and institutional practices shape dataset availability and quality. Drawing on more than 3,300 datasets, we employ quantitative and qualitative methods to examine how these factors drive distinct creation and curation patterns across Chinese, Japanese, and Korean NLP communities. Our findings highlight the large-scale and often institution-driven nature of Chinese datasets, grassroots community-led development in Korean NLP, and an entertainment- and subculture-focused emphasis on Japanese collections. By uncovering these patterns, we reveal practical strategies for enhancing dataset documentation, licensing clarity, and cross-lingual resource sharing - ultimately guiding more effective and culturally attuned LLM development in East Asia. We conclude by discussing best practices for future dataset curation and collaboration, aiming to strengthen resource development across all three languages.

artificial intelligence, large language model, natural language, (16 more...)

arXiv.org Artificial Intelligence

2507.04329

Country:

North America > United States > Minnesota (0.28)
Asia > East Asia (0.24)

Genre: Research Report > New Finding (0.34)

Technology: Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.89)

Add feedback

Scott Farquhar thinks Australia should let AI train for free on creative content. He overlooks one key point

The GuardianAug-14-2025

Farquhar, the Tech Council of Australia CEO, told ABC's 7.30 program on Tuesday: "all AI usage of mining or searching or going across data is probably illegal under Australian law and I think that hurts a lot of investment of these companies in Australia". Farquhar's claim overlooks that this is not a settled issue in the US, and could have devastating effects on creative industries. Farquhar's argument is that it is not theft of people's work unless the AI is used to "copy an artist directly" such as creating a song in their style. "I do think people would say that, hey, if people are going to sit down with a digital companion, an AI song creator and they collaboratively work with an AI to create something new to the world, that's probably fair use." Farquhar said the benefits of large language models outweigh the issues raised by AI training its data on other people's work for free.

australia, fair use, farquhar, (16 more...)

The Guardian

Country:

Oceania > Australia (1.00)
North America > United States (0.75)

Industry:

Law > Intellectual Property & Technology Law (0.90)
Government > Regional Government > North America Government > United States Government (0.34)

Technology: Information Technology > Artificial Intelligence > Natural Language (1.00)

Add feedback

Towards Best Practices for Open Datasets for LLM Training

Baack, Stefan, Biderman, Stella, Odrozek, Kasia, Skowron, Aviya, Bdeir, Ayah, Bommarito, Jillian, Ding, Jennifer, Gahntz, Maximilian, Keller, Paul, Langlais, Pierre-Carl, Lindahl, Greg, Majstorovic, Sebastian, Marda, Nik, Penedo, Guilherme, Van Segbroeck, Maarten, Wang, Jennifer, von Werra, Leandro, Baker, Mitchell, Belião, Julie, Chmielinski, Kasia, Fadaee, Marzieh, Gutermuth, Lisa, Kydlíček, Hynek, Leppert, Greg, Lewis-Jong, EM, Larsen, Solana, Longpre, Shayne, Lungati, Angela Oduor, Miller, Cullen, Miller, Victor, Ryabinin, Max, Siminyu, Kathleen, Strait, Andrew, Surman, Mark, Tumadóttir, Anna, Weber, Maurice, Weiss, Rebecca, White, Lee, Wolf, Thomas

arXiv.org Artificial IntelligenceJan-14-2025

Many AI companies are training their large language models (LLMs) on data without the permission of the copyright owners. The permissibility of doing so varies by jurisdiction: in countries like the EU and Japan, this is allowed under certain restrictions, while in the United States, the legal landscape is more ambiguous. Regardless of the legal status, concerns from creative producers have led to several high-profile copyright lawsuits, and the threat of litigation is commonly cited as a reason for the recent trend towards minimizing the information shared about training datasets by both corporate and public interest actors. This trend in limiting data information causes harm by hindering transparency, accountability, and innovation in the broader ecosystem by denying researchers, auditors, and impacted individuals access to the information needed to understand AI models. While this could be mitigated by training language models on open access and public domain data, at the time of writing, there are no such models (trained at a meaningful scale) due to the substantial technical and sociological challenges in assembling the necessary corpus. These challenges include incomplete and unreliable metadata, the cost and complexity of digitizing physical records, and the diverse set of legal and technical skills required to ensure relevance and responsibility in a quickly changing landscape. Building towards a future where AI systems can be trained on openly licensed data that is responsibly curated and governed requires collaboration across legal, technical, and policy domains, along with investments in metadata standards, digitization, and fostering a culture of openness.

dataset, license, public domain, (17 more...)

arXiv.org Artificial Intelligence

2501.08365

Country:

Asia > Japan (0.24)
North America > United States > New York (0.04)
Europe > France (0.04)

Genre: Research Report (0.81)

Industry:

Law > Intellectual Property & Technology Law (1.00)
Government > Regional Government > North America Government > United States Government (1.00)
Law > Litigation (0.88)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)

Add feedback

"They've Stolen My GPL-Licensed Model!": Toward Standardized and Transparent Model Licensing

Duan, Moming, Zhao, Rui, Jiang, Linshan, Shadbolt, Nigel, He, Bingsheng

arXiv.org Artificial IntelligenceDec-16-2024

As model parameter sizes reach the billion-level range and their training consumes zettaFLOPs of computation, components reuse and collaborative development are become increasingly prevalent in the Machine Learning (ML) community. These components, including models, software, and datasets, may originate from various sources and be published under different licenses, which govern the use and distribution of licensed works and their derivatives. However, commonly chosen licenses, such as GPL and Apache, are software-specific and are not clearly defined or bounded in the context of model publishing. Meanwhile, the reused components may also have free-content licenses and model licenses, which pose a potential risk of license noncompliance and rights infringement within the model production workflow. In this paper, we propose addressing the above challenges along two lines: 1) For license analysis, we have developed a new vocabulary for ML workflow management and encoded license rules to enable ontological reasoning for analyzing rights granting and compliance issues. 2) For standardized model publishing, we have drafted a set of model licenses that provide flexible options to meet the diverse needs of model publishing. Our analysis tool is built on Turtle language and Notation3 reasoning engine, envisioned as a first step toward Linked Open Model Production Data. We have also encoded our proposed model licenses into rules and demonstrated the effects of GPL and other commonly used licenses in model publishing, along with the flexibility advantages of our licenses, through comparisons and experiments.

artificial intelligence, machine learning, natural language, (19 more...)

arXiv.org Artificial Intelligence

2412.11483

Country:

Europe > United Kingdom > England > Oxfordshire > Oxford (0.14)
Europe > Finland (0.04)
North America > United States > New Jersey (0.04)
(5 more...)

Genre: Research Report (0.64)

Industry: Law > Intellectual Property & Technology Law (1.00)

Technology:

Information Technology > Communications > Web (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Ontologies (0.90)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.48)

Add feedback

This Startup Wants YouTube Creators to Get Paid for AI Training Data

WIREDSep-30-2024, 15:15:00 GMT

So far, when AI companies have trained on YouTube's invaluable stash of videos, captions, and other content, they've done so without permission. An AI-focused content licensing startup called Calliope Networks is hoping to change that with its new "License to Scrape," a program aimed directly at YouTube stars. "There's obvious demand from AI companies to scrape YouTube content. We see that by their actions. So what we're trying to do is to create a tool that makes it legal and simple for them," says Calliope Networks CEO Dave Davis.

ai training data, calliope, youtube content, (11 more...)

WIRED

Industry: Media > News (0.33)

Technology:

Information Technology > Communications > Social Media (1.00)
Information Technology > Artificial Intelligence (1.00)

Add feedback

A New Group Is Trying to Make AI Data Licensing Ethical

WIREDSep-4-2024, 11:00:00 GMT

The first wave of major generative AI tools largely were trained on "publicly available" data--basically, anything and everything that could be scraped from the internet. Now, sources of training data are increasingly restricting access and pushing for licensing agreements. With the hunt for additional data sources intensifying, new licensing startups have emerged to keep the source material flowing. The Dataset Providers Alliance, a trade group formed this summer, wants to make the AI industry more standardized and fair. To that end, it has just released a position paper outlining its stances on major AI-related issues.

artificial intelligence, machine learning, natural language, (8 more...)

WIRED

Technology:

Information Technology > Artificial Intelligence > Natural Language > Generation (0.64)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning > Generative AI (0.64)

Add feedback

Journalists Had 'No Idea' About OpenAI's Deal to Use Their Stories

WIREDDec-21-2023, 14:00:00 GMT

Last week, OpenAI and the German media conglomerate Axel Springer signed a multi-year licensing agreement. It allows OpenAI to incorporate articles from Axel Springer–owned outlets like Business Insider and Politico into its products, including ChatGPT. Although the deal centers on using journalistic work, reporters whose stories will be shared as part of the agreement were not consulted about the deal beforehand. Four Business Insider employees told WIRED that they found out about the AI deal at the same time it was announced publicly. PEN Guild, the US union which represents around 280 workers at Politico and E&E News, another Axel Springer publication, says it was "not consulted or informed about the decision to have robots summarize our work."

agreement, journalist, openai, (6 more...)

WIRED

Industry: Media > News (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Natural Language > Chatbot (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning > Generative AI (0.99)

Add feedback

DP-TBART: A Transformer-based Autoregressive Model for Differentially Private Tabular Data Generation

Castellon, Rodrigo, Gopal, Achintya, Bloniarz, Brian, Rosenberg, David

arXiv.org Artificial IntelligenceJul-19-2023

The generation of synthetic tabular data that preserves differential privacy is a problem of growing importance. While traditional marginal-based methods have achieved impressive results, recent work has shown that deep learning-based approaches tend to lag behind. In this work, we present Differentially-Private TaBular AutoRegressive Transformer (DP-TBART), a transformer-based autoregressive model that maintains differential privacy and achieves performance competitive with marginal-based methods on a wide variety of datasets, capable of even outperforming state-of-the-art methods in certain settings. We also provide a theoretical framework for understanding the limitations of marginal-based approaches and where deep learning-based approaches stand to contribute most. These results suggest that deep learning-based techniques should be considered as a viable alternative to marginal-based methods in the generation of differentially private synthetic tabular data.

arxiv preprint arxiv, dataset, marginal-based method, (12 more...)

arXiv.org Artificial Intelligence

2307.1043

Country:

North America > United States > California > San Francisco County > San Francisco (0.14)
North America > United States > Washington > King County (0.04)
North America > United States > New York > New York County > New York City (0.04)
(3 more...)

Genre: Research Report > New Finding (0.88)

Industry:

Information Technology > Security & Privacy (1.00)
Banking & Finance (0.93)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

Whose Text Is It Anyway? Exploring BigCode, Intellectual Property, and Ethics

Choksi, Madiha Zahrah, Goedicke, David

arXiv.org Artificial IntelligenceApr-5-2023

Intelligent or generative writing tools rely on large language models that recognize, summarize, translate, and predict content. This position paper probes the copyright interests of open data sets used to train large language models (LLMs). Our paper asks, how do LLMs trained on open data sets circumvent the copyright interests of the used data? We start by defining software copyright and tracing its history. We rely on GitHub Copilot as a modern case study challenging software copyright. Our conclusion outlines obstacles that generative writing assistants create for copyright, and offers a practical road map for copyright analysis for developers, software law experts, and general users to consider in the context of intelligent LLM-powered writing tools.

artificial intelligence, large language model, natural language, (17 more...)

arXiv.org Artificial Intelligence

2304.02839

Country: North America > United States > New York > New York County > New York City (0.06)

Genre: Research Report (0.70)

Industry: Law > Intellectual Property & Technology Law (1.00)

Technology: Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)

Add feedback