Measuring short-form factuality in large language models