Towards Reliable Evaluation of Large Language Models for Multilingual and Multimodal E-Commerce Applications