Public Domain 12M: A Highly Aesthetic Image-Text Dataset with Novel Governance Mechanisms
Meyer, Jordan, Padgett, Nick, Miller, Cullen, Exline, Laura
–arXiv.org Artificial Intelligence
We present Public Domain 12M (PD12M), a dataset of 12.4 million high-quality public domain and CC0-licensed images with synthetic captions, designed for training text-to-image models. PD12M is the largest public domain image-text dataset to date, with sufficient size to train foundation models while minimizing copyright concerns. Through the Source.Plus platform, we also introduce novel, community-driven dataset governance mechanisms that reduce harm and support reproducibility over time.
arXiv.org Artificial Intelligence
Oct-30-2024
- Country:
- Europe (0.28)
- Genre:
- Research Report (0.40)
- Industry:
- Information Technology (0.69)
- Law (1.00)
- Technology:
- Information Technology > Artificial Intelligence
- Machine Learning (1.00)
- Natural Language (1.00)
- Vision (1.00)
- Information Technology > Artificial Intelligence