Composition Vision-Language Understanding via Segment and Depth Anything Model