Common Benchmarks Undervalue the Generalization Power of Programmatic Policies