long-term memory
Memory Mosaics at scale
Memory Mosaics [Zhang et al., 2025], networks of associative memories, have demonstrated appealing compositional and in-context learning capabilities on medium-scale networks (GPT-2 scale) and synthetic small datasets. This work shows that these favorable properties remain when we scale memory mosaics to large language model sizes (llama-8B scale) and real-world datasets. To this end, we scale memory mosaics to 10B size, we train them on one trillion tokens, we introduce a couple architectural modifications ("memory mosaics v2"), we assess their capabilities across three evaluation dimensions: training-knowledge storage, new-knowledge storage, and in-context learning. Throughout the evaluation, memory mosaics v2 match transformers on the learning of training knowledge (first dimension) and significantly outperforms transformers on carrying out new tasks at inference time (second and third dimensions). These improvements cannot be easily replicated by simply increasing the training data for transformers. A memory mosaics v2 trained on one trillion tokens still perform better on these tasks than a transformer trained on eight trillion tokens.
Smooth and Flexible Camera Movement Synthesis via Temporal Masked Generative Modeling
In dance performances, choreographers define the visual expression of movement, while cinematographers shape its final presentation through camera work. Consequently, the synthesis of camera movements informed by both music and dance has garnered increasing research interest. While recent advancements have led to notable progress in this area, existing methods predominantly operate in an offline manner--that is, they require access to the entire dance sequence before generating corresponding camera motions. This constraint renders them impractical for real-time applications, particularly in live stage performances, where immediate responsiveness is essential. To address this limitation, we introduce a more practical yet challenging task: online camera movement synthesis, in which camera trajectories must be generated using only the current and preceding segments of dance and music. In this paper, we propose TemMEGA (Temporal Masked Generative Modeling), a unified framework capable of handling both online and offline camera movement generation. TemMEGA consists of three key components.
Decision Mamba: Reinforcement Learning via Hybrid Selective Sequence Modeling
Recent works have shown the remarkable superiority of transformer models in reinforcement learning (RL), where the decision-making problem is formulated as sequential generation. Transformer-based agents could emerge with self-improvement in online environments by providing task contexts, such as multiple trajectories, called in-context RL. However, due to the quadratic computation complexity of attention in transformers, current in-context RL methods suffer from huge computational costs as the task horizon increases. In contrast, the Mamba model is renowned for its efficient ability to process long-term dependencies, which provides an opportunity for in-context RL to solve tasks that require long-term memory. To this end, we first implement Decision Mamba (DM) by replacing the backbone of Decision Transformer (DT).
WISE: Rethinking the Knowledge Memory for Lifelong Model Editing of Large Language Models
Large language models (LLMs) need knowledge updates to meet the ever-growing world facts and correct the hallucinated responses, facilitating the methods of lifelong model editing. Where the updated knowledge resides in memories is a fundamental question for model editing. In this paper, we find that editing either long-term memory (direct model parameters) or working memory (non-parametric knowledge of neural network activations/representations by retrieval) will result in an impossible triangle---reliability, generalization, and locality can not be realized together in the lifelong editing settings. For long-term memory, directly editing the parameters will cause conflicts with irrelevant pretrained knowledge or previous edits (poor reliability and locality). For working memory, retrieval-based activations can hardly make the model understand the edits and generalize (poor generalization). Therefore, we propose WISE to bridge the gap between memories.
MemVLT: Vision-LanguageTrackingwithAdaptive Memory-basedPrompts
As an extension of traditional visual single object tracking (SOT) task [2, 3, 4], VLT can harness the complementary advantages of multiple modalities. Therefore, vision-language trackers (VLTs) have the potential to achieve more promising tracking performance, which has recently attracted widespreadattention[5,6,7,8].