predictedtrajectorybyha...
2e5c2cb8d13e8fba78d95211440ba326-Supplemental.pdf
Finally, Section E illustrates qualitative results. We present the encoder-decoder variant of HAMT in fine-tuning on the right of Figure 1. Compared to the original cross-modal transformer on the left, the variant removes text-tovision cross-modal attention. The encoder encodes the texts to obtain textual embeddings. Theoriginal target location is viewed as a middle stop point.