ScienceAgentBench: Toward Rigorous Assessment of Language Agents for Data-Driven Scientific Discovery