Evaluating Probabilistic Inference in Deep Learning: Beyond Marginal Predictions