Do GFlowNets Transfer? Case Study on the Game of 24/42

Gupta, Adesh, Kumar, Abhinav, Gupta, Mansi, Chopra, Paras

arXiv.org Artificial Intelligence 

Generating diverse solutions is key to human-like reasoning, yet autoregres-sive language models focus on single accurate responses, limiting creativity. Our case study shows their limited zero-shot transferability by fine-tuning small and medium-sized large language models on the Game of 24 and testing them on the Game of 42 datasets. Results revealed that GFlowNets struggle to maintain solution diversity and accuracy, highlighting key limitations in their cross-task generalization and the need for future research in improved transfer learning capabilities. Recent advances have introduced approaches showing significant improvement in LLM reasoning capabilities (Touvron et al., 2023a), including supervised fine-tuning with synthetic datasets (Y u et al.; Y ue et al.), modified decoding mechanisms (Holtzman et al.; Nguyen et al., 2024), and enhanced pretraining data quality (Akter et al., 2024; Trinh et al., 2024). While these approaches demonstrate improved accuracy, they rarely account for the diversity of correct solutions, an essential aspect of human-like reasoning and creativity (Y u et al., 2024a; Hu et al.).