Goto

Collaborating Authors

 fresh concern


Fresh concerns raised over sources of training material for AI systems

The Guardian

Fresh fears have been raised about the training material used for some of the largest and most powerful artificial intelligence models, after several investigations exposed the fascist, pirated and malicious sources from which the data is harvested. One such dataset is the Colossal Clean Crawled Corpus, or C4, assembled by Google from more than 15m websites and used to train both the search engine's LaMDA AI as well as Meta's GPT competitor, LLaMA. The dataset is public, but its scale has made it difficult to examine the contents: it is supposedly a "clean" version of a more expansive dataset, Common Crawl, with "noisy" content, offensive language and racist slurs removed from the material. But an investigation by the Washington Post reveals that C4's "cleanliness" is only skin deep. While it draws on websites such as the Guardian – which makes up 0.05% of the entire dataset - and Wikipedia, as well as large databases such as Google Patents and the scientific journal hub PLOS, it also contains less reputable sites. The white nationalist site VDARE is in the database, one of the 1,000 largest sites, as is the far-right news site Breitbart.


Fresh concerns about AI bias in the age of COVID-19

#artificialintelligence

Businesses facing unprecedented demands during the coronavirus pandemic have boosted their use of artificial intelligence in some of society's most sensitive areas. Why it matters: Algorithms and the data they rely on are prone to automating preexisting biases -- and are more likely to do so when they're rushed into the field without careful testing and review. The big picture: Beyond these examples, experts worry that the economy's sudden halt has driven resource-strapped companies and institutions to increasingly rely on algorithms to make decisions in housing, credit, employment and other areas. Between the lines: If you are going to use AI in making meaningful decisions, experts recommend making sure a diverse group of people is involved in reviewing everything from the algorithm design to the training data to the way the system will be deployed and evaluated. Yes, but: Hollister notes that adding humans to the mix isn't a cure-all, either, given that humans have plenty of bias as well.