Pandora: Towards General World Model with Natural Language Actions and Video States