SAM-CLIP: Merging Vision Foundation Models towards Semantic and Spatial Understanding