Breakpoint: Scalable evaluation of system-level reasoning in LLM code agents

Open in new window