Does RoBERTa Perform Better than BERT in Continual Learning: An Attention Sink Perspective

Open in new window