Generating Benchmarks for Factuality Evaluation of Language Models