baidu baike
How Censorship Can Influence Artificial Intelligence
Artificial intelligence is hardly confined by international borders, as businesses, universities, and governments tap a global pool of ideas, algorithms, and talent. Yet the AI programs that result from this global gold rush can still reflect deep cultural divides. New research shows how government censorship affects AI algorithms--and can influence the applications built with those algorithms. Margaret Roberts, a political science professor at UC San Diego, and Eddie Yang, a PhD student there, examined AI language algorithms trained on two sources: the Chinese-language version of Wikipedia, which is blocked within China; and Baidu Baike, a similar site operated by China's dominant search engine, Baidu, that is subject to government censorship. Baidu did not respond to a request for comment.
Censorship of Online Encyclopedias: Implications for NLP Models
Yang, Eddie, Roberts, Margaret E.
NLP impacts how firms provide products to users, content individuals receive through search and social media, and how While artificial intelligence provides the backbone for many tools individuals interact with news and emails. Despite the growing people use around the world, recent work has brought to attention importance of NLP algorithms in shaping our lives, recently scholars, that the algorithms powering AI are not free of politics, stereotypes, policymakers, and the business community have raised the and bias. While most work in this area has focused on the ways alarm of how gender and racial biases may be baked into these algorithms. in which AI can exacerbate existing inequalities and discrimination, Because they are trained on human data, the algorithms very little work has studied how governments actively shape themselves can replicate implicit and explicit human biases and training data. We describe how censorship has affected the development aggravate discrimination [6, 8, 39]. Additionally, training data that of Wikipedia corpuses, text data which are regularly used over-represents a subset of the population may do a worse job for pre-trained inputs into NLP algorithms. We show that word embeddings at predicting outcomes for other groups in the population [13].