Episodic Memory in Lifelong Language Learning
Cyprien de Masson d'Autume, Sebastian Ruder, Lingpeng Kong, Dani Yogatama
We introduce a lifelong language learning setup where a model needs to learn from a stream of text examples without any dataset identifier. We propose an episodic memory model that performs sparse experience replay and local adaptation to mitigate catastrophic forgetting in this setup. Experiments on text classification and question answering demonstrate the complementary benefits of sparse experience replay and local adaptation to allow the model to continuously learn from new datasets. We also show that the space complexity of the episodic memory module can be reduced significantly ( 50-90%) by randomly choosing which examples to store in memory with a minimal decrease in performance. We consider an episodic memory component as a crucial building block of general linguistic intelligence and see our model as a first step in that direction.
f8d2e80c1458ea2501f98a2cafadb397-AuthorFeedback.pdf
We thank the reviewers for their thoughtful comments. We will add them to an updated version of the paper and perform more analysis to improve it. Examples where local adaptation helps. Table 1: Two examples where local adaptation helps. Context: david niven (actor) - pics, videos, dating, & news david niven male born mar 1, 1910 james david graham niven, known professionally as david niven, was an english actor and novelist [... ] Query: in 1959, for which film did david niven win his only academy award?
Verified Uncertainty Calibration
Ananya Kumar, Percy S. Liang, Tengyu Ma
Applications such as weather forecasting and personalized medicine demand models that output calibrated probability estimates--those representative of the true likelihood of a prediction. Most models are not calibrated out of the box but are recalibrated by post-processing model outputs. We find in this work that popular recalibration methods like Platt scaling and temperature scaling are (i) less calibrated than reported, and (ii) current techniques cannot estimate how miscalibrated they are.
Multi-armed Bandits: Competing with Optimal Sequences
We consider sequential decision making problem in the adversarial setting, where regret is measured with respect to the optimal sequence of actions and the feedback adheres the bandit setting. It is well-known that obtaining sublinear regret in this setting is impossible in general, which arises the question of when can we do better than linear regret? Previous works show that when the environment is guaranteed to vary slowly and furthermore we are given prior knowledge regarding its variation (i.e., a limit on the amount of changes suffered by the environment), then this task is feasible. The caveat however is that such prior knowledge is not likely to be available in practice, which causes the obtained regret bounds to be somewhat irrelevant. Our main result is a regret guarantee that scales with the variation parameter of the environment, without requiring any prior knowledge about it whatsoever. By that, we also resolve an open problem posted by Gur, Zeevi and Besbes [8]. An important key component in our result is a statistical test for identifying non-stationarity in a sequence of independent random variables. This test either identifies nonstationarity or upper-bounds the absolute deviation of the corresponding sequence of mean values in terms of its total variation. This test is interesting on its own right and has the potential to be found useful in additional settings.
Focus On What Matters: Separated Models For Visual-Based RL Generalization
A primary challenge for visual-based Reinforcement Learning (RL) is to generalize effectively across unseen environments. Although previous studies have explored different auxiliary tasks to enhance generalization, few adopt image reconstruction due to concerns about exacerbating overfitting to task-irrelevant features during training. Perceiving the pre-eminence of image reconstruction in representation learning, we propose SMG (Separated Models for Generalization), a novel approach that exploits image reconstruction for generalization. SMG introduces two model branches to extract task-relevant and task-irrelevant representations separately from visual observations via cooperatively reconstruction. Built upon this architecture, we further emphasize the importance of task-relevant features for generalization. Specifically, SMG incorporates two additional consistency losses to guide the agent's focus toward task-relevant areas across different scenarios, thereby achieving free from overfitting. Extensive experiments in DMC demonstrate the SOTA performance of SMG in generalization, particularly excelling in video-background settings. Evaluations on robotic manipulation tasks further confirm the robustness of SMG in real-world applications. Source code is available at https://anonymous.4open.science/r/SMG/.
Soft Superpixel Neighborhood Attention
Images contain objects with deformable boundaries, such as the contours of a human face, yet attention operators act on square windows. This mixes features from perceptually unrelated regions, which can degrade the quality of a denoiser. One can exclude pixels using an estimate of perceptual groupings, such as superpixels, but the naive use of superpixels can be theoretically and empirically worse than standard attention. Using superpixel probabilities rather than superpixel assignments, this paper proposes soft superpixel neighborhood attention (SNA) which interpolates between the existing neighborhood attention and the naive superpixel neighborhood attention. This paper presents theoretical results showing SNA is the optimal denoiser under a latent superpixel model. SNA outperforms alternative local attention modules on image denoising, and we compare the superpixels learned from denoising with those learned with superpixel supervision.
Multi-relational Poincaré Graph Embeddings Carl Allen 1 Timothy Hospedales
Hyperbolic embeddings have recently gained attention in machine learning due to their ability to represent hierarchical data more accurately and succinctly than their Euclidean analogues. However, multi-relational knowledge graphs often exhibit multiple simultaneous hierarchies, which current hyperbolic models do not capture. To address this, we propose a model that embeds multi-relational graph data in the Poincaré ball model of hyperbolic space. Our Multi-Relational Poincaré model (MuRP) learns relation-specific parameters to transform entity embeddings by Möbius matrix-vector multiplication and Möbius addition. Experiments on the hierarchical WN18RR knowledge graph show that our Poincaré embeddings outperform their Euclidean counterpart and existing embedding methods on the link prediction task, particularly at lower dimensionality.
All reviewers Architecture ablation study: An ablation study over different model architectures (Table (a)) shows that the chosen
Whilst the architecture is in part motivated empirically, it is also based on a recent theoretical rationale (ICML 2019 Honorable Mention) for using a vector offset (translation) to represent a relation [1]. FB15k's lack of hierarchy offers no advantage to hyperbolic embeddings, but its large number MuRP does not also set out to include MTL, but we hope to address this in future work. We will include all recommendations, e.g. However, we agree that it is important to compare models across a range of dimensionalities. Note that for MuRP with biases replaced by (transformed) norms, performance reduces (e.g.