A Picture is Worth a Thousand Words: Language Models Plan from Pixels