Multi-modal Auto-regressive Modeling via Visual Words