Supplementary Materials for M3 ViT: Mixture-of-Experts Vision Transformer for Efficient Multi-task Learning with M odel-Accelerator Co-design
–Neural Information Processing Systems
The final ViT block's output feature will be fed into decoders for multi-task predictions. The router is a single-layer MLP which maps token embedding to experts' selection probability. The batch size is 16. LUTs, 461K registers, 11 Mbit block RAM, and 27 Mbit UltraRAM. It runs at a clock frequency of 1,395 MHz and consumes 295 W of power.
Neural Information Processing Systems
Aug-18-2025, 02:29:15 GMT