TOIST: TaskOrientedInstanceSegmentation TransformerwithNoun-PronounDistillation SupplementaryMaterial
–Neural Information Processing Systems
As mentioned in Section 3(formulation) of the main paper, in an input image, it is possible that no objects or multiple objects afford a specific task. As areminder,we use the whole verb-pronoun (or verb-noun) description as token span. With probability 0.5, an image is cropped to a random size, where each side is between384and1333pixels. Both of the student and teacher TOIST models are initialized with the model pre-trained by [4]. In an image, the most suitable objects (one or more) for solving the task are selected and their bounding boxes are taken as ground truth labels for detection.
Neural Information Processing Systems
Feb-9-2026, 17:35:38 GMT
- Technology: