Navigating Privacy and Copyright Challenges Across the Data Lifecycle of Generative AI

Zhang, Dawen, Xia, Boming, Liu, Yue, Xu, Xiwei, Hoang, Thong, Xing, Zhenchang, Staples, Mark, Lu, Qinghua, Zhu, Liming

arXiv.org Artificial Intelligence 

The internet has enabled an unprecedented free flow and wide distribution of information on a global scale, which largely accelerated the democratization of information, fueling platforms like Wikipedia, YouTube, and StackOverflow. While this facilitated information democratization, it concurrently lowered barriers against unauthorized data use and piracy. The success of Deep Learning (DL) owes significantly to the availability of large-scale datasets available for training DL models [3], predominantly sourced from the internet [4].