The authors of the manuscript consider the continuum learning setting, where the learner observes a stream of data points from training, which are ordered according to the tasks they belong to, i.e. the learner encounters any data from the next task only after it has observed all the training data for the current one. The authors propose a set of three metrics for evaluation performance of learning algorithms in this setting, which reflect their ability to transfer information to new tasks and not forget information about the earlier tasks. Could the authors, please, comment on the difference between continuum and lifelong learning (the corresponding sentence in line 254 seems incomplete)? The authors also propose a learning method, termed Gradient of Episodic Memory (GEM). The idea of the method is to keep a set of examples from every observed task and make sure that at each update stage the loss on the observed tasks does not increase.
One major obstacle towards AI is the poor ability of models to solve new problems quicker, and without forgetting previously acquired knowledge. To better understand this issue, we study the problem of continual learning, where the model observes, once and one by one, examples concerning a sequence of tasks. First, we propose a set of metrics to evaluate models learning over a continuum of data. These metrics characterize models not only by their test accuracy, but also in terms of their ability to transfer knowledge across tasks. Second, we propose a model for continual learning, called Gradient Episodic Memory (GEM) that alleviates forgetting, while allowing beneficial transfer of knowledge to previous tasks. Our experiments on variants of the MNIST and CIFAR-100 datasets demonstrate the strong performance of GEM when compared to the state-of-the-art.
In this paper a new method for image manipulation is proposed. The proposed method incorporates a hierarchical framework and provides both interactive and automatic semantic object-level image manipulation. In the interactive manipulation setting, the user can select a bounding box where image editing for adding and removing objects will be applied. The proposed network architecture consists of a foreground output stream which produces the predictions on binary object mask and a background output stream for producing per-pixel label maps. As the result, the proposed image manipulation method generates output image by filling in the pixel-level textures guided by the semantic layout.
The paper addresses the very important topic of lifelong learning, and it proposes to employ an episodic memory to avoid catastrophic forgetting. The memory is based on a key-value representation that exploits an encoder-decoder architecture based on BERT. The training is made on the concatenation of different datasets, of which there is no need to specify the identifiers. The work is highly significant and the novelty of the contribution is remarkable. One point that would have deserved more attention is the strategies for the reading and writing of the episodic memory (see also comments below).
We introduce a lifelong language learning setup where a model needs to learn from a stream of text examples without any dataset identifier. We propose an episodic memory model that performs sparse experience replay and local adaptation to mitigate catastrophic forgetting in this setup. Experiments on text classification and question answering demonstrate the complementary benefits of sparse experience replay and local adaptation to allow the model to continuously learn from new datasets. We also show that the space complexity of the episodic memory module can be reduced significantly ( 50-90%) by randomly choosing which examples to store in memory with a minimal decrease in performance. We consider an episodic memory component as a crucial building block of general linguistic intelligence and see our model as a first step in that direction.
This paper proposes the use of memory in life-long learning to prevent catastrophic forgetting by means of experience replay and local adaptation. The idea is simple yet it is an interesting new step in this line of work. The paper would be a good addition to the conference, and has support from reviewers.
The authors do a good job of motivating their work, and they contribute a nice experimental section with good results. The ablation study was thorough. Well done! --- Many tasks that might be given to an RL agent are impossible without working memory. This paper presents a suite of tasks which require use of that memory in order to succeed. These tasks are compiled from a variety of other sources, either directly or re-implemented for this suite.
We thank the reviewers for their thoughtful and constructive feedback on our manuscript. This should help both contextualize each task's difficulty and illustrate what it involves. Reviewer 3 noted the Section 2 task descriptions could be better presented. We have reformatted it so that "the order We also changed our description of IMPALA to match Reviewer 5's suggestion. Regarding the task suite, Reviewer 4 raised a thoughtful consideration on whether "most of the findings translate when Some 3D tasks in the suite already have '2D-like' semi-counterparts that do not require navigation, '2D-like' because everything is fully observable and the agent has a first-person point of view from a fixed point, without Spot the Difference level, was overall harder than Change Detection for our ablation models.
Weaknesses: - The method section looks not self-contained and lacks descriptions of some key components. In particular: * What is Eq.(9) for? Why "the SL is the negative logarithm of a polynomial in \theta" -- where is the "negative logarithm" in Eq.(9)? It looks its practical implementation is discussed in the "Evaluating the Semantic Loss" part (L.140) which involves the Weighted Model Count (WMC) and knowledge compilation (KC). However, no details about KC are presented.