[R] LSTM as a Dynamically Computed Element-wise Weighted Sum - reinterprets gating in LSTMs as self-attention over time (a weighted sum over the candidate states c _t); shows that dependence on h_t-1 for computing c _t is not necessary; thus gating does the heavy lifting in LSTMs, not h h mappings • r/MachineLearning