LongComp: Long-Tail Compositional Zero-Shot Generalization for Robust Trajectory Prediction

Stoler, Benjamin, Francis, Jonathan, Oh, Jean

arXiv.org Artificial Intelligence 

Next, we train autoencoders for ego and social vectors separately. We further split by object type and train independent models for each type, allowing distinct latent spaces to be learned for e.g., pedestrian focal agents versus vehicle focal agents. Each autoencoder consists of a simple encoder and decoder multi-layer perceptron (MLP), with layer normalization and dropout on hidden layers; the encoder maps down to a low-dimensional latent space and the decoder maps back to the original feature space. That is, we compute z = Enc(v) and v = Dec(z). We train the models primarily with a mean-square error (MSE) reconstruction loss between v and v, along with a deep embedding clustering (DEC) [43] loss for regularization on the latent z values. We then obtain discrete ego and social contexts by performing clustering within the latent spaces captured by these autoencoders, using k-means with k = 11. We use the Waymo Open Motion Dataset (WOMD) [15] as a representative source of AD scenarios, sampling approximately 20% of the total data. To quantitatively assess cluster and latent space coherence, we compute silhouette scores on held-out sets [44], observing values ranging from 0.31 to 0.50, which indicates a reasonably well-structured space. We also visualize UMAP [41] projections of the resulting spaces in Figure 2, showing clear separation and evidence of potential sub-clusters.