"Is This It?": Towards Ecologically Valid Benchmarks for Situated Collaboration