Scaling Proprioceptive-Visual Learning with Heterogeneous Pre-trained Transformers