Breakpoint: Scalable evaluation of system-level reasoning in LLM code agents