BridgeTower: Building Bridges Between Encoders in Vision-Language Representation Learning