VLIPP: Towards Physically Plausible Video Generation with Vision and Language Informed Physical Prior