Evaluating Language-Model Agents on Realistic Autonomous Tasks