OpenAIs o3 and o4-mini hallucinate way higher than previous models

Mashable 

By OpenAI's own testing, its newest reasoning models, o3 and o4-mini, hallucinate significantly higher than o1. First reported by TechCrunch, OpenAI's system card detailed the PersonQA evaluation results, designed to test for hallucinations. From the results of this evaluation, o3's hallucination rate is 33 percent, and o4-mini's hallucination rate is 48 percent -- almost half of the time. By comparison, o1's hallucination rate is 16 percent, meaning o3 hallucinated about twice as often. The system card noted how o3 "tends to make more claims overall, leading to more accurate claims as well as more inaccurate/hallucinated claims." But OpenAI doesn't know the underlying cause, simply saying, "More research is needed to understand the cause of this result."