A Additional experimental details

Neural Information Processing Systems 

For each function generated, we sample 228 data points that we separate into 100 context points and 128 target points and train the model using the loss function in (2). Each input x is a 32-dimensional vector, and each dimension is sampled from a uniform distribution U[ 3, 3]. We then randomly select 100 samples from the data points with function values lower than the 20th percentile as the few-shot data. We normalize the score to [0, 1] using the worst and the best value in the large dataset. A.2 ExPT pretraining details Architectural details In all experiments, we use the same ExPT architecture.