How should we evaluate progress in AI?