Browse and Concentrate: Comprehending Multimodal Content via prior-LLM Context Fusion