From ViT Features to Training-free Video Object Segmentation via Streaming-data Mixture Models

Neural Information Processing Systems 

In the task of semi-supervised video object segmentation, the input is the binary mask of an object in the first frame, and the desired output consists of the corresponding masks of that object in the subsequent frames.