Comparing Apples to Oranges: LLM-powered Multimodal Intention Prediction in an Object Categorization Task