Benchmarking Large Language Models As AI Research Agents

Open in new window