Sohl-Dickstein, Jascha
Scaling Exponents Across Parameterizations and Optimizers
Everett, Katie, Xiao, Lechao, Wortsman, Mitchell, Alemi, Alexander A., Novak, Roman, Liu, Peter J., Gur, Izzeddin, Sohl-Dickstein, Jascha, Kaelbling, Leslie Pack, Lee, Jaehoon, Pennington, Jeffrey
Robust and effective scaling of models from small to large width typically requires the precise adjustment of many algorithmic and architectural details, such as parameterization and optimizer choices. In this work, we propose a new perspective on parameterization by investigating a key assumption in prior work about the alignment between parameters and data and derive new theoretical results under weaker assumptions and a broader set of optimizers. Our extensive empirical investigation includes tens of thousands of models trained with all combinations of three optimizers, four parameterizations, several alignment assumptions, more than a dozen learning rates, and fourteen model sizes up to 26.8B parameters. We find that the best learning rate scaling prescription would often have been excluded by the assumptions in prior work. Our results show that all parameterizations, not just maximal update parameterization (muP), can achieve hyperparameter transfer; moreover, our novel per-layer learning rate prescription for standard parameterization outperforms muP. Finally, we demonstrate that an overlooked aspect of parameterization, the epsilon parameter in Adam, must be scaled correctly to avoid gradient underflow and propose Adam-atan2, a new numerically stable, scale-invariant version of Adam that eliminates the epsilon hyperparameter entirely.
The boundary of neural network trainability is fractal
Sohl-Dickstein, Jascha
Some fractals -- for instance those associated with the Mandelbrot and quadratic Julia sets -- are computed by iterating a function, and identifying the boundary between hyperparameters for which the resulting series diverges or remains bounded. Neural network training similarly involves iterating an update function (e.g. repeated steps of gradient descent), can result in convergent or divergent behavior, and can be extremely sensitive to small changes in hyperparameters. Motivated by these similarities, we experimentally examine the boundary between neural network hyperparameters that lead to stable and divergent training. We find that this boundary is fractal over more than ten decades of scale in all tested configurations.
Beyond Human Data: Scaling Self-Training for Problem-Solving with Language Models
Singh, Avi, Co-Reyes, John D., Agarwal, Rishabh, Anand, Ankesh, Patil, Piyush, Garcia, Xavier, Liu, Peter J., Harrison, James, Lee, Jaehoon, Xu, Kelvin, Parisi, Aaron, Kumar, Abhishek, Alemi, Alex, Rizkowsky, Alex, Nova, Azade, Adlam, Ben, Bohnet, Bernd, Elsayed, Gamaleldin, Sedghi, Hanie, Mordatch, Igor, Simpson, Isabelle, Gur, Izzeddin, Snoek, Jasper, Pennington, Jeffrey, Hron, Jiri, Kenealy, Kathleen, Swersky, Kevin, Mahajan, Kshiteej, Culp, Laura, Xiao, Lechao, Bileschi, Maxwell L., Constant, Noah, Novak, Roman, Liu, Rosanne, Warkentin, Tris, Qian, Yundi, Bansal, Yamini, Dyer, Ethan, Neyshabur, Behnam, Sohl-Dickstein, Jascha, Fiedel, Noah
Fine-tuning language models~(LMs) on human-generated data remains a prevalent practice. However, the performance of such models is often limited by the quantity and diversity of high-quality human data. In this paper, we explore whether we can go beyond human data on tasks where we have access to scalar feedback, for example, on math problems where one can verify correctness. To do so, we investigate a simple self-training method based on expectation-maximization, which we call ReST$^{EM}$, where we (1) generate samples from the model and filter them using binary feedback, (2) fine-tune the model on these samples, and (3) repeat this process a few times. Testing on advanced MATH reasoning and APPS coding benchmarks using PaLM-2 models, we find that ReST$^{EM}$ scales favorably with model size and significantly surpasses fine-tuning only on human data. Overall, our findings suggest self-training with feedback can substantially reduce dependence on human-generated data.
Variance-Reduced Gradient Estimation via Noise-Reuse in Online Evolution Strategies
Li, Oscar, Harrison, James, Sohl-Dickstein, Jascha, Smith, Virginia, Metz, Luke
Unrolled computation graphs are prevalent throughout machine learning but present challenges to automatic differentiation (AD) gradient estimation methods when their loss functions exhibit extreme local sensitivtiy, discontinuity, or blackbox characteristics. In such scenarios, online evolution strategies methods are a more capable alternative, while being more parallelizable than vanilla evolution strategies (ES) by interleaving partial unrolls and gradient updates. In this work, we propose a general class of unbiased online evolution strategies methods. We analytically and empirically characterize the variance of this class of gradient estimators and identify the one with the least variance, which we term Noise-Reuse Evolution Strategies (NRES). Experimentally, we show NRES results in faster convergence than existing AD and ES methods in terms of wall-clock time and number of unroll steps across a variety of applications, including learning dynamical systems, meta-training learned optimizers, and reinforcement learning.
Reduce, Reuse, Recycle: Compositional Generation with Energy-Based Diffusion Models and MCMC
Du, Yilun, Durkan, Conor, Strudel, Robin, Tenenbaum, Joshua B., Dieleman, Sander, Fergus, Rob, Sohl-Dickstein, Jascha, Doucet, Arnaud, Grathwohl, Will
Since their introduction, diffusion models have quickly become the prevailing approach to generative modeling in many domains. They can be interpreted as learning the gradients of a time-varying sequence of log-probability density functions. This interpretation has motivated classifier-based and classifier-free guidance as methods for post-hoc control of diffusion models. In this work, we build upon these ideas using the score-based interpretation of diffusion models, and explore alternative ways to condition, modify, and reuse diffusion models for tasks involving compositional generation and guidance. In particular, we investigate why certain types of composition fail using current techniques and present a number of solutions. We conclude that the sampler (not the model) is responsible for this failure and propose new samplers, inspired by MCMC, which enable successful compositional generation. Further, we propose an energy-based parameterization of diffusion models which enables the use of new compositional operators and more sophisticated, Metropolis-corrected samplers. Intriguingly we find these samplers lead to notable improvements in compositional generation across a wide set of problems such as classifier-guided ImageNet modeling and compositional text-to-image generation.
Frontier Language Models are not Robust to Adversarial Arithmetic, or "What do I need to say so you agree 2+2=5?
Freeman, C. Daniel, Culp, Laura, Parisi, Aaron, Bileschi, Maxwell L, Elsayed, Gamaleldin F, Rizkowsky, Alex, Simpson, Isabelle, Alemi, Alex, Nova, Azade, Adlam, Ben, Bohnet, Bernd, Mishra, Gaurav, Sedghi, Hanie, Mordatch, Igor, Gur, Izzeddin, Lee, Jaehoon, Co-Reyes, JD, Pennington, Jeffrey, Xu, Kelvin, Swersky, Kevin, Mahajan, Kshiteej, Xiao, Lechao, Liu, Rosanne, Kornblith, Simon, Constant, Noah, Liu, Peter J., Novak, Roman, Qian, Yundi, Fiedel, Noah, Sohl-Dickstein, Jascha
We introduce and study the problem of adversarial arithmetic, which provides a simple yet challenging testbed for language model alignment. This problem is comprised of arithmetic questions posed in natural language, with an arbitrary adversarial string inserted before the question is complete. Even in the simple setting of 1-digit addition problems, it is easy to find adversarial prompts that make all tested models (including PaLM2, GPT4, Claude2) misbehave, and even to steer models to a particular wrong answer. We additionally provide a simple algorithm for finding successful attacks by querying those same models, which we name "prompt inversion rejection sampling" (PIRS). We finally show that models can be partially hardened against these attacks via reinforcement learning and via agentic constitutional loops. However, we were not able to make a language model fully robust against adversarial arithmetic attacks.
Beyond the Imitation Game: Quantifying and extrapolating the capabilities of language models
Srivastava, Aarohi, Rastogi, Abhinav, Rao, Abhishek, Shoeb, Abu Awal Md, Abid, Abubakar, Fisch, Adam, Brown, Adam R., Santoro, Adam, Gupta, Aditya, Garriga-Alonso, Adriร , Kluska, Agnieszka, Lewkowycz, Aitor, Agarwal, Akshat, Power, Alethea, Ray, Alex, Warstadt, Alex, Kocurek, Alexander W., Safaya, Ali, Tazarv, Ali, Xiang, Alice, Parrish, Alicia, Nie, Allen, Hussain, Aman, Askell, Amanda, Dsouza, Amanda, Slone, Ambrose, Rahane, Ameet, Iyer, Anantharaman S., Andreassen, Anders, Madotto, Andrea, Santilli, Andrea, Stuhlmรผller, Andreas, Dai, Andrew, La, Andrew, Lampinen, Andrew, Zou, Andy, Jiang, Angela, Chen, Angelica, Vuong, Anh, Gupta, Animesh, Gottardi, Anna, Norelli, Antonio, Venkatesh, Anu, Gholamidavoodi, Arash, Tabassum, Arfa, Menezes, Arul, Kirubarajan, Arun, Mullokandov, Asher, Sabharwal, Ashish, Herrick, Austin, Efrat, Avia, Erdem, Aykut, Karakaล, Ayla, Roberts, B. Ryan, Loe, Bao Sheng, Zoph, Barret, Bojanowski, Bartลomiej, รzyurt, Batuhan, Hedayatnia, Behnam, Neyshabur, Behnam, Inden, Benjamin, Stein, Benno, Ekmekci, Berk, Lin, Bill Yuchen, Howald, Blake, Orinion, Bryan, Diao, Cameron, Dour, Cameron, Stinson, Catherine, Argueta, Cedrick, Ramรญrez, Cรฉsar Ferri, Singh, Chandan, Rathkopf, Charles, Meng, Chenlin, Baral, Chitta, Wu, Chiyu, Callison-Burch, Chris, Waites, Chris, Voigt, Christian, Manning, Christopher D., Potts, Christopher, Ramirez, Cindy, Rivera, Clara E., Siro, Clemencia, Raffel, Colin, Ashcraft, Courtney, Garbacea, Cristina, Sileo, Damien, Garrette, Dan, Hendrycks, Dan, Kilman, Dan, Roth, Dan, Freeman, Daniel, Khashabi, Daniel, Levy, Daniel, Gonzรกlez, Daniel Moseguรญ, Perszyk, Danielle, Hernandez, Danny, Chen, Danqi, Ippolito, Daphne, Gilboa, Dar, Dohan, David, Drakard, David, Jurgens, David, Datta, Debajyoti, Ganguli, Deep, Emelin, Denis, Kleyko, Denis, Yuret, Deniz, Chen, Derek, Tam, Derek, Hupkes, Dieuwke, Misra, Diganta, Buzan, Dilyar, Mollo, Dimitri Coelho, Yang, Diyi, Lee, Dong-Ho, Schrader, Dylan, Shutova, Ekaterina, Cubuk, Ekin Dogus, Segal, Elad, Hagerman, Eleanor, Barnes, Elizabeth, Donoway, Elizabeth, Pavlick, Ellie, Rodola, Emanuele, Lam, Emma, Chu, Eric, Tang, Eric, Erdem, Erkut, Chang, Ernie, Chi, Ethan A., Dyer, Ethan, Jerzak, Ethan, Kim, Ethan, Manyasi, Eunice Engefu, Zheltonozhskii, Evgenii, Xia, Fanyue, Siar, Fatemeh, Martรญnez-Plumed, Fernando, Happรฉ, Francesca, Chollet, Francois, Rong, Frieda, Mishra, Gaurav, Winata, Genta Indra, de Melo, Gerard, Kruszewski, Germรกn, Parascandolo, Giambattista, Mariani, Giorgio, Wang, Gloria, Jaimovitch-Lรณpez, Gonzalo, Betz, Gregor, Gur-Ari, Guy, Galijasevic, Hana, Kim, Hannah, Rashkin, Hannah, Hajishirzi, Hannaneh, Mehta, Harsh, Bogar, Hayden, Shevlin, Henry, Schรผtze, Hinrich, Yakura, Hiromu, Zhang, Hongming, Wong, Hugh Mee, Ng, Ian, Noble, Isaac, Jumelet, Jaap, Geissinger, Jack, Kernion, Jackson, Hilton, Jacob, Lee, Jaehoon, Fisac, Jaime Fernรกndez, Simon, James B., Koppel, James, Zheng, James, Zou, James, Kocoล, Jan, Thompson, Jana, Wingfield, Janelle, Kaplan, Jared, Radom, Jarema, Sohl-Dickstein, Jascha, Phang, Jason, Wei, Jason, Yosinski, Jason, Novikova, Jekaterina, Bosscher, Jelle, Marsh, Jennifer, Kim, Jeremy, Taal, Jeroen, Engel, Jesse, Alabi, Jesujoba, Xu, Jiacheng, Song, Jiaming, Tang, Jillian, Waweru, Joan, Burden, John, Miller, John, Balis, John U., Batchelder, Jonathan, Berant, Jonathan, Frohberg, Jรถrg, Rozen, Jos, Hernandez-Orallo, Jose, Boudeman, Joseph, Guerr, Joseph, Jones, Joseph, Tenenbaum, Joshua B., Rule, Joshua S., Chua, Joyce, Kanclerz, Kamil, Livescu, Karen, Krauth, Karl, Gopalakrishnan, Karthik, Ignatyeva, Katerina, Markert, Katja, Dhole, Kaustubh D., Gimpel, Kevin, Omondi, Kevin, Mathewson, Kory, Chiafullo, Kristen, Shkaruta, Ksenia, Shridhar, Kumar, McDonell, Kyle, Richardson, Kyle, Reynolds, Laria, Gao, Leo, Zhang, Li, Dugan, Liam, Qin, Lianhui, Contreras-Ochando, Lidia, Morency, Louis-Philippe, Moschella, Luca, Lam, Lucas, Noble, Lucy, Schmidt, Ludwig, He, Luheng, Colรณn, Luis Oliveros, Metz, Luke, ลenel, Lรผtfi Kerem, Bosma, Maarten, Sap, Maarten, ter Hoeve, Maartje, Farooqi, Maheen, Faruqui, Manaal, Mazeika, Mantas, Baturan, Marco, Marelli, Marco, Maru, Marco, Quintana, Maria Jose Ramรญrez, Tolkiehn, Marie, Giulianelli, Mario, Lewis, Martha, Potthast, Martin, Leavitt, Matthew L., Hagen, Matthias, Schubert, Mรกtyรกs, Baitemirova, Medina Orduna, Arnaud, Melody, McElrath, Melvin, Yee, Michael A., Cohen, Michael, Gu, Michael, Ivanitskiy, Michael, Starritt, Michael, Strube, Michael, Swฤdrowski, Michaล, Bevilacqua, Michele, Yasunaga, Michihiro, Kale, Mihir, Cain, Mike, Xu, Mimee, Suzgun, Mirac, Walker, Mitch, Tiwari, Mo, Bansal, Mohit, Aminnaseri, Moin, Geva, Mor, Gheini, Mozhdeh, T, Mukund Varma, Peng, Nanyun, Chi, Nathan A., Lee, Nayeon, Krakover, Neta Gur-Ari, Cameron, Nicholas, Roberts, Nicholas, Doiron, Nick, Martinez, Nicole, Nangia, Nikita, Deckers, Niklas, Muennighoff, Niklas, Keskar, Nitish Shirish, Iyer, Niveditha S., Constant, Noah, Fiedel, Noah, Wen, Nuan, Zhang, Oliver, Agha, Omar, Elbaghdadi, Omar, Levy, Omer, Evans, Owain, Casares, Pablo Antonio Moreno, Doshi, Parth, Fung, Pascale, Liang, Paul Pu, Vicol, Paul, Alipoormolabashi, Pegah, Liao, Peiyuan, Liang, Percy, Chang, Peter, Eckersley, Peter, Htut, Phu Mon, Hwang, Pinyu, Miลkowski, Piotr, Patil, Piyush, Pezeshkpour, Pouya, Oli, Priti, Mei, Qiaozhu, Lyu, Qing, Chen, Qinlang, Banjade, Rabin, Rudolph, Rachel Etta, Gabriel, Raefer, Habacker, Rahel, Risco, Ramon, Milliรจre, Raphaรซl, Garg, Rhythm, Barnes, Richard, Saurous, Rif A., Arakawa, Riku, Raymaekers, Robbe, Frank, Robert, Sikand, Rohan, Novak, Roman, Sitelew, Roman, LeBras, Ronan, Liu, Rosanne, Jacobs, Rowan, Zhang, Rui, Salakhutdinov, Ruslan, Chi, Ryan, Lee, Ryan, Stovall, Ryan, Teehan, Ryan, Yang, Rylan, Singh, Sahib, Mohammad, Saif M., Anand, Sajant, Dillavou, Sam, Shleifer, Sam, Wiseman, Sam, Gruetter, Samuel, Bowman, Samuel R., Schoenholz, Samuel S., Han, Sanghyun, Kwatra, Sanjeev, Rous, Sarah A., Ghazarian, Sarik, Ghosh, Sayan, Casey, Sean, Bischoff, Sebastian, Gehrmann, Sebastian, Schuster, Sebastian, Sadeghi, Sepideh, Hamdan, Shadi, Zhou, Sharon, Srivastava, Shashank, Shi, Sherry, Singh, Shikhar, Asaadi, Shima, Gu, Shixiang Shane, Pachchigar, Shubh, Toshniwal, Shubham, Upadhyay, Shyam, Shyamolima, null, Debnath, null, Shakeri, Siamak, Thormeyer, Simon, Melzi, Simone, Reddy, Siva, Makini, Sneha Priscilla, Lee, Soo-Hwan, Torene, Spencer, Hatwar, Sriharsha, Dehaene, Stanislas, Divic, Stefan, Ermon, Stefano, Biderman, Stella, Lin, Stephanie, Prasad, Stephen, Piantadosi, Steven T., Shieber, Stuart M., Misherghi, Summer, Kiritchenko, Svetlana, Mishra, Swaroop, Linzen, Tal, Schuster, Tal, Li, Tao, Yu, Tao, Ali, Tariq, Hashimoto, Tatsu, Wu, Te-Lin, Desbordes, Thรฉo, Rothschild, Theodore, Phan, Thomas, Wang, Tianle, Nkinyili, Tiberius, Schick, Timo, Kornev, Timofei, Tunduny, Titus, Gerstenberg, Tobias, Chang, Trenton, Neeraj, Trishala, Khot, Tushar, Shultz, Tyler, Shaham, Uri, Misra, Vedant, Demberg, Vera, Nyamai, Victoria, Raunak, Vikas, Ramasesh, Vinay, Prabhu, Vinay Uday, Padmakumar, Vishakh, Srikumar, Vivek, Fedus, William, Saunders, William, Zhang, William, Vossen, Wout, Ren, Xiang, Tong, Xiaoyu, Zhao, Xinran, Wu, Xinyi, Shen, Xudong, Yaghoobzadeh, Yadollah, Lakretz, Yair, Song, Yangqiu, Bahri, Yasaman, Choi, Yejin, Yang, Yichi, Hao, Yiding, Chen, Yifu, Belinkov, Yonatan, Hou, Yu, Hou, Yufang, Bai, Yuntao, Seid, Zachary, Zhao, Zhuoye, Wang, Zijian, Wang, Zijie J., Wang, Zirui, Wu, Ziyi
Language models demonstrate both quantitative improvement and new qualitative capabilities with increasing scale. Despite their potentially transformative impact, these new capabilities are as yet poorly characterized. In order to inform future research, prepare for disruptive new model capabilities, and ameliorate socially harmful effects, it is vital that we understand the present and near-future capabilities and limitations of language models. To address this challenge, we introduce the Beyond the Imitation Game benchmark (BIG-bench). BIG-bench currently consists of 204 tasks, contributed by 450 authors across 132 institutions. Task topics are diverse, drawing problems from linguistics, childhood development, math, common-sense reasoning, biology, physics, social bias, software development, and beyond. BIG-bench focuses on tasks that are believed to be beyond the capabilities of current language models. We evaluate the behavior of OpenAI's GPT models, Google-internal dense transformer architectures, and Switch-style sparse transformers on BIG-bench, across model sizes spanning millions to hundreds of billions of parameters. In addition, a team of human expert raters performed all tasks in order to provide a strong baseline. Findings include: model performance and calibration both improve with scale, but are poor in absolute terms (and when compared with rater performance); performance is remarkably similar across model classes, though with benefits from sparsity; tasks that improve gradually and predictably commonly involve a large knowledge or memorization component, whereas tasks that exhibit "breakthrough" behavior at a critical scale often involve multiple steps or components, or brittle metrics; social bias typically increases with scale in settings with ambiguous context, but this can be improved with prompting.
General-Purpose In-Context Learning by Meta-Learning Transformers
Kirsch, Louis, Harrison, James, Sohl-Dickstein, Jascha, Metz, Luke
Modern machine learning requires system designers to specify aspects of the learning pipeline, such as losses, architectures, and optimizers. Meta-learning, or learning-to-learn, instead aims to learn those aspects, and promises to unlock greater capabilities with less manual effort. One particularly ambitious goal of meta-learning is to train general-purpose in-context learning algorithms from scratch, using only black-box models with minimal inductive bias. Such a model takes in training data, and produces test-set predictions across a wide range of problems, without any explicit definition of an inference model, training loss, or optimization algorithm. In this paper we show that Transformers and other blackbox models can be meta-trained to act as general-purpose in-context learners. We characterize transitions between algorithms that generalize, algorithms that memorize, and algorithms that fail to meta-train at all, induced by changes in model size, number of tasks, and meta-optimization. We further show that the capabilities of meta-trained algorithms are bottlenecked by the accessible state size (memory) determining the next prediction, unlike standard models which are thought to be bottlenecked by parameter count. Finally, we propose practical interventions such as biasing the training distribution that improve the meta-training and metageneralization of general-purpose in-context learning algorithms. Meta-learning is the process of automatically discovering new learning algorithms instead of designing them manually (Schmidhuber, 1987). An important quality of human-engineered learning algorithms, such as backpropagation and gradient descent, is their applicability to a wide range of tasks or environments. For learning-to-learn to exceed those capabilities, the meta-learned learning algorithms must be similarily general-purpose. Recently, there has been significant progress toward this goal (Kirsch et al., 2019; Oh et al., 2020). The improved generality of the discovered learning algorithms has been achieved by introducing inductive bias, such as by bottlenecking the architecture or by hiding information, which encourage learning over memorization. Methods include restricting learning rules to use gradients (Metz et al., 2019; Kirsch et al., 2019; Oh et al., 2020), symbolic graphs (Real et al., 2020; Co-Reyes et al., 2021), or parameter sharing (Kirsch & Schmidhuber, 2020; Kirsch et al., 2021). While enabling generalization, these inductive biases come at the cost of increasing the effort to design these systems and potentially restrict the space of discoverable learning algorithms. Instead, we seek to explore general-purpose meta-learning systems with minimal inductive bias.
VeLO: Training Versatile Learned Optimizers by Scaling Up
Metz, Luke, Harrison, James, Freeman, C. Daniel, Merchant, Amil, Beyer, Lucas, Bradbury, James, Agrawal, Naman, Poole, Ben, Mordatch, Igor, Roberts, Adam, Sohl-Dickstein, Jascha
While deep learning models have replaced hand-designed features across many domains, these models are still trained with hand-designed optimizers. In this work, we leverage the same scaling approach behind the success of deep learning to learn versatile optimizers. We train an optimizer for deep learning which is itself a small neural network that ingests gradients and outputs parameter updates. Meta-trained with approximately four thousand TPU-months of compute on a wide variety of optimization tasks, our optimizer not only exhibits compelling performance, but optimizes in interesting and unexpected ways. It requires no hyperparameter tuning, instead automatically adapting to the specifics of the problem being optimized. We open source our learned optimizer, meta-training code, the associated train and test data, and an extensive optimizer benchmark suite with baselines at velo-code.github.io.
Fast Finite Width Neural Tangent Kernel
Novak, Roman, Sohl-Dickstein, Jascha, Schoenholz, Samuel S.
The Neural Tangent Kernel (NTK), defined as $\Theta_\theta^f(x_1, x_2) = \left[\partial f(\theta, x_1)\big/\partial \theta\right] \left[\partial f(\theta, x_2)\big/\partial \theta\right]^T$ where $\left[\partial f(\theta, \cdot)\big/\partial \theta\right]$ is a neural network (NN) Jacobian, has emerged as a central object of study in deep learning. In the infinite width limit, the NTK can sometimes be computed analytically and is useful for understanding training and generalization of NN architectures. At finite widths, the NTK is also used to better initialize NNs, compare the conditioning across models, perform architecture search, and do meta-learning. Unfortunately, the finite width NTK is notoriously expensive to compute, which severely limits its practical utility. We perform the first in-depth analysis of the compute and memory requirements for NTK computation in finite width networks. Leveraging the structure of neural networks, we further propose two novel algorithms that change the exponent of the compute and memory requirements of the finite width NTK, dramatically improving efficiency. Our algorithms can be applied in a black box fashion to any differentiable function, including those implementing neural networks. We open-source our implementations within the Neural Tangents package (arXiv:1912.02803) at https://github.com/google/neural-tangents.