Prompting is not a substitute for probability measurements in large language models