The Jumping Reasoning Curve? Tracking the Evolution of Reasoning Performance in GPT-[n] and o-[n] Models on Multimodal Puzzles
Toh, Vernon Y. H., Chia, Yew Ken, Ghosal, Deepanway, Poria, Soujanya
–arXiv.org Artificial Intelligence
In our evaluation, we assess the performance of GPT-[n] and o-[n] models on abstract multimodal puzzles from PuzzleVQA, which primarily test abstract reasoning. Additionally, we evaluate the models on AlgoPuzzleVQA, which require an algorithmic approach rather than brute-force solving. To ensure a comprehensive evaluation, we present the puzzles in both multiple-choice and openended question answering formats. Our findings indicate that despite their sophisticated capabilities in standard benchmarks, current models still struggle with seemingly simple multimodal puzzles (Figure 3). Contrary to previous benchmarks such as ARC-AGI, we observe a less dramatic reasoning curve without extreme jumps in performance. This limitation highlights the substantial gap between current artificial intelligence and human-like reasoning abilities. As the models continue to rapidly advance and scale as in Figure 1, this benchmark will serve as a critical indicator of progress toward more robust and generalized artificial intelligence. Overall, here are the key findings of our study: TL;DR 1. Performance steadily improves from GPT-4-Turbo to GPT-4o to o1. While the jump from GPT-4-Turbo to GPT-4o is moderate, the transition from GPT-4o to o1 marks a significant advancement but it comes at a cost of 750x more inference cost.
arXiv.org Artificial Intelligence
Feb-3-2025