The Download: how your data is being used to train AI, and why chatbots aren't doctors

Jul-21-2025, 12:10:00 GMT–MIT Technology Review

Millions of images of passports, credit cards, birth certificates, and other documents containing personally identifiable information are likely included in one of the biggest open-source AI training sets, new research has found. Thousands of images--including identifiable faces--were found in a small subset of DataComp CommonPool, a major AI training set for image generation scraped from the web. Because the researchers audited just 0.1% of CommonPool's data, they estimate that the real number of images containing personally identifiable information, including faces and identity documents, is in the hundreds of millions. Anything you put online can be and probably has been scraped. AI companies have stopped warning you that their chatbots aren't doctors AI companies have now mostly abandoned the once-standard practice of including medical disclaimers and warnings in response to health questions, new research has found.

identifiable information, machine learning, natural language, (11 more...)

MIT Technology Review

Jul-21-2025, 12:10:00 GMT

News Web Page

Add feedback

Industry:
- Health & Medicine (0.41)

Technology:
- Information Technology > Artificial Intelligence
  - Machine Learning (0.88)
  - Natural Language > Chatbot (0.65)