Towards Label-free Scene Understanding by Vision Foundation Models