Towards Efficient Visual-Language Alignment of the Q-Former for Visual Reasoning Tasks