Measuring AI Ability to Complete Long Tasks