Goto

Collaborating Authors

 supplementarymaterialsf...


SupplementaryMaterialsforM3ViT: Mixture-of-ExpertsVision TransformerforEfficientMulti-taskLearning withModel-AcceleratorCo-design

Neural Information Processing Systems

The final ViT block'soutput feature will be fed into decoders for multi-task predictions. Eachdecoder contains five conv layers (the first four of dimension 256 and the final one of dimension corresponding to taskprediction) andfourupsampling layers. Compared toSoTAencoder-focused workCross-Stitch, although M3ViTperforms slightly lower onNYUD-v2 with twotasks, itachievesbetter performance onalltheother settings. Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs.