Towards a Realistic Long-Term Benchmark for Open-Web Research Agents