Sparse Weight Activation Training
Raihan, Md Aamir, Aamodt, Tor M.
They have an indexing unit for enabling the sparse multiplication. The computations are spatially mapped and scheduled to these processing units by a control and scheduling logic. Each of the PE generates partial products which get accumulated to compute the output values and finally stored in the DRAM. Mapping Computations: Let us consider a convolutional layer, which maps the input activations in ( R N C H I W I) to out ( R N F H O W O). The layer computes F channels of output feature maps, each of dimension R H O W O, using C channel of input feature maps of dimension R H I W I for each of the N samples in the mini-batch. The layer has parameter w R F C H K W K . Algorithm 1: Dense Forward Pass Computation for a single input sample (Assuming Stride 1)The data: w,in The result: out for h o 1 to H O do for w o 1 to W O do for f 1 to F do for c 1 to C do for h k 1 to H K do for w k 1 to W K do c c; h h o h k; w w o w k; out[f ][h o][w o] w [ f ][c ][h k][w k] in [c ][h ][w ]); end end end end end end Thus, as shown in algorithm 1, each activation is reused F C H K W K times, each weight is reused N C H K W K times and the total computation is as follow: Dense Convolution FLOP F H O W O C H K W K (7) The first three'for' loops are independent and can be mapped independently to the PEs, whereas the inner three'for' loop generate the partial products. The different sparse accelerators have different ways of mapping the'for' loops spatially over the PEs for maximizing reuse and minimizing the data transfer to and from the DRAM.
Jan-7-2020
- Country:
- Asia > Middle East
- Jordan (0.04)
- North America > Canada
- British Columbia (0.04)
- Asia > Middle East
- Genre:
- Research Report (0.82)
- Technology: