Leveraging Large (Visual) Language Models for Robot 3D Scene Understanding