Towards a Realistic Long-Term Benchmark for Open-Web Research Agents

Open in new window