Post-mortem on a deep learning contest: a Simpson's paradox and the complementary roles of scale metrics versus shape metrics