Scalable CP Decomposition for Tensor Learning using GPU Tensor Cores