Learning to Tune Like an Expert: Interpretable and Scene-Aware Navigation via MLLM Reasoning and CVAE-Based Adaptation

Wang, Yanbo, Fang, Zipeng, Zhao, Lei, Chen, Weidong

arXiv.org Artificial Intelligence 

--Service robots are increasingly deployed in diverse and dynamic environments, where both physical layouts and social contexts change over time and across locations. In these unstructured settings, conventional navigation systems that rely on fixed parameters often fail to generalize across scenarios, resulting in degraded performance and reduced social acceptance. Although recent approaches have leveraged reinforcement learning to enhance traditional planners, these methods often fail in real-world deployments due to poor generalization and limited simulation diversity, which hampers effective sim-to-real transfer . T o tackle these issues, we present LE-Nav, an interpretable and scene-aware navigation framework that leverages multi-modal large language model reasoning and conditional variational autoencoders to adaptively tune planner hyperpa-rameters. T o achieve zero-shot scene understanding, we utilize one-shot exemplars and chain-of-thought prompting strategies. Experiments show that LE-Nav can generate hyperparameters achieving human-level tuning across diverse planners and scenarios. Real-world navigation trials and a user study on a smart wheelchair platform demonstrate that it outperforms state-of-the-art methods on quantitative metrics such as success rate, efficiency, safety, and comfort, while receiving higher subjective scores for perceived safety and social acceptance. Note to Practitioners--Service robots often experience degraded performance of traditional local planners due to changing and dynamic environmental conditions during navigation. This work investigates automatic hyperparameter tuning for planners such as DW A and TEB, and our framework LE-Nav can be used to adjust hyperparameters of any optimization-based planner . Existing navigation frameworks are typically either end-to-end, lacking safety guarantees, or rely on reinforcement learning-based tuning with limited generalization. By designing two prompting strategies, we enable the MLLM to generate stable and accurate scene descriptions. We use a conditional variational autoencoder to learn human expert tuning strategies, enhanced with data augmentation and attention masking to address inevitable MLLM packet loss in real applications. The decoupling of the MLLM and action modules improves decision transparency, allowing clear insight into how scene analysis informs navigation behavior . Experiments demonstrate that our method adaptively generates hyperparameters comparable to human experts, while being robust to packet loss and compatible with various MLLMs. Future work includes enhancing real-time scene understanding with advanced MLLMs, expanding support to more planners with personalized tuning, and extending the framework to collaborative multi-robot systems.