Benchmarking Large Language Models As AI Research Agents