Automatically Benchmarking LLM Code Agents through Agent-Driven Annotation and Evaluation