Advancing General Multimodal Capability of Vision-language Models with Pyramid-descent Visual Position Encoding