How to build a better AI benchmark