A Experimental Details

Neural Information Processing Systems 

We dynamically batch model calls onto the GPU in order to increase inference speed. For OBL, there are dependencies between policy and belief training. The entire inference and training infrastructure for a single policy or belief model uses a machine with 30 CPU cores and 2 GPUs, one GPU for training and one for simulation. We use their public-lstm architecture design. We use a 3-layer feedforward neural network to encode the entire private observation.