Multi-modal Motion Prediction using Temporal Ensembling with Learning-based Aggregation