Kohima
Learning to Plan for Language Modeling from Unlabeled Data
Cornille, Nathan, Moens, Marie-Francine, Mai, Florian
By training to predict the next token in an unlabeled corpus, large language models learn to perform many tasks without any labeled data. However, their next-token-prediction objective arguably limits their performance in scenarios that require planning, such as writing a coherent article. In this paper, we train a module for planning the future writing process via a self-supervised learning objective. By conditioning on generated latent plans, our model extends the successful language model formula to more abstract planning in an unsupervised way. Empirically, we demonstrate that our method improves language modeling performance in general, particularly with respect to the text structure. Because our framework uses a planner module that is unsupervised and external to the language model, new planner modules can be trained at large scale and easily be shared with the community.
Detecting Pretraining Data from Large Language Models
Shi, Weijia, Ajith, Anirudh, Xia, Mengzhou, Huang, Yangsibo, Liu, Daogao, Blevins, Terra, Chen, Danqi, Zettlemoyer, Luke
Although large language models (LLMs) are widely deployed, the data used to train them is rarely disclosed. Given the incredible scale of this data, up to trillions of tokens, it is all but certain that it includes potentially problematic text such as copyrighted materials, personally identifiable information, and test data for widely reported reference benchmarks. However, we currently have no way to know which data of these types is included or in what proportions. In this paper, we study the pretraining data detection problem: given a piece of text and black-box access to an LLM without knowing the pretraining data, can we determine if the model was trained on the provided text? To facilitate this study, we introduce a dynamic benchmark WIKIMIA that uses data created before and after model training to support gold truth detection. We also introduce a new detection method Min-K% Prob based on a simple hypothesis: an unseen example is likely to contain a few outlier words with low probabilities under the LLM, while a seen example is less likely to have words with such low probabilities. Min-K% Prob can be applied without any knowledge about the pretraining corpus or any additional training, departing from previous detection methods that require training a reference model on data that is similar to the pretraining data. Moreover, our experiments demonstrate that Min-K% Prob achieves a 7.4% improvement on WIKIMIA over these previous methods. We apply Min-K% Prob to three real-world scenarios, copyrighted book detection, contaminated downstream example detection and privacy auditing of machine unlearning, and find it a consistently effective solution.
Best of the web: Artificial Intelligence news for October 22, 2016
With Stephen Hawking opening an AI lab it's only a matter of time before smart robots take over for humans in the factory, on the battlefield, in the supermarket, and behind the counter. There's an old Chinese saying: "If you want to do anything good, easy and fast, you need connections," said Nancy Yang, a spokesperson for the fourth annual Seattle Biz-Tech Summit meeting today in Bellevue, Wash., outside Seattle. "Here in the U.S., we use email and messaging, but the Chinese way, and really for many Asians, is to meet face to face." Tanvi Lad shook off her first game loss and pulled off a rare victory over Rituparna Das in a three-set match and entered the women's singles final of the Manorama-Indian Open National-ranking badminton tournament here on Saturday. Stephen Hawking, the famous scientist who once said intelligent machines could be mankind's biggest threat, opened an artificial intelligence lab in Britain this week to help develop robot surgeons and Terminator-style military droids.