Train vast neural networks together
Hivemind uses Kademlia-based DHT that can scale to tens of thousands of peers with logarithmic search complexity. On each forward pass, a peer first determines what "speciality" of experts is needed to process the current inputs using a small "gating function" module. Then it finds k (e.g. 4) most suitable experts from other peers in the network using the DHT protocol. Finally, it sends forward pass requests to the selected experts, collects their outputs and averages them for the final prediction. Compared to traditional architectures, the Mixture-of-Experts needs much less bandwidth as every input is only sent to a small fraction of all experts. More importantly, the decentralized Mixture-of-Experts layers are inherently fault-tolerant: if some of the chosen experts fail to respond, the model will simply average the remaining ones and call that dropout.
Aug-27-2020, 14:00:17 GMT