Goto

Collaborating Authors

 Quenum, Jerome


LLARVA: Vision-Action Instruction Tuning Enhances Robot Learning

arXiv.org Artificial Intelligence

Recently, instruction-tuned Large Multimodal Models (LMMs), such as InstructBLIP [1], Instruct-GPT [2], LLaVA [3, 4], PALM [5] and others have demonstrated state-of-the-art performance on a variety of vision-and-language tasks. However, existing LMMs for robotics [6, 7, 8, 9] do not always demonstrate the same success and consistency across various embodied settings. This may result from the unique challenges encountered in robotics, such as the variability of real-world environments, the differences between robots, and the need to control actions reliably. Since LMMs have been proven to be successful in part due to multimodal instruction tuning, it is natural to leverage this technique in a robotics setting as well. Here, we propose a vision-action instruction tuning method that can bridge the gap between a language model's fundamental pre-training objective--next-word prediction--and the goal of enabling the model to handle various robotics settings. In this work, we introduce our Large LAnguage model for Robotic Vision and Action (LLARVA), an open-source instruction-tuned LMM for robotic applications that can generalize efficiently across various environments and robotic configurations. Our key idea is the formulation of a novel instruction prompt that encapsulates robot type, task, scene configuration, and control regime in a natural language prefix amenable to contemporary LMMs.


Lithium Metal Battery Quality Control via Transformer-CNN Segmentation

arXiv.org Artificial Intelligence

Lithium metal battery (LMB) has the potential to be the next-generation battery system because of its high theoretical energy density. However, defects known as dendrites are formed by heterogeneous lithium (Li) plating, which hinders the development and utilization of LMBs. Non-destructive techniques to observe the dendrite morphology often use X-ray computed tomography (XCT) to provide cross-sectional views. To retrieve three-dimensional structures inside a battery, image segmentation becomes essential to quantitatively analyze XCT images. This work proposes a new semantic segmentation approach using a transformer-based neural network called TransforCNN that is capable of segmenting out dendrites from XCT data. In addition, we compare the performance of the proposed TransforCNN with three other algorithms, such as U-Net, Y-Net, and E-Net, consisting of an Ensemble Network model for XCT analysis. Our results show the advantages of using TransforCNN when evaluating over-segmentation metrics, such as mean Intersection over Union (mIoU) and mean Dice Similarity Coefficient (mDSC) as well as through several qualitatively comparative visualizations.