A Critical Review of Causal Reasoning Benchmarks for Large Language Models