SupplementaryMaterialsforM3ViT: Mixture-of-ExpertsVision TransformerforEfficientMulti-taskLearning withModel-AcceleratorCo-design
–Neural Information Processing Systems
The final ViT block'soutput feature will be fed into decoders for multi-task predictions. Eachdecoder contains five conv layers (the first four of dimension 256 and the final one of dimension corresponding to taskprediction) andfourupsampling layers. Compared toSoTAencoder-focused workCross-Stitch, although M3ViTperforms slightly lower onNYUD-v2 with twotasks, itachievesbetter performance onalltheother settings. Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs.
Neural Information Processing Systems
Feb-11-2026, 12:48:01 GMT
- Technology: