Continuing Education
It's not too late to stop Trump and the Silicon Valley broligarchy from controlling our lives, but we must act now Carole Cadwalladr
To walk into the lion's den once might be considered foolhardy. To do so again after being mauled by the lion? Six years ago I gave a talk at Ted, the world's leading technology and ideas conference. It led to a gruelling lawsuit and a series of consequences that reverberate through my life to this day. And last week I returned. To give another talk that would incorporate some of my experience: a Ted Talk about being sued for giving a Ted Talk, and how the lessons I'd learned from surviving all that were a model for surviving "broligarchy" โ a concept I first wrote about in the Observer in July last year: the alignment of Silicon Valley and autocracy, and a kind of power the world has never seen before.
HOUDINI: Lifelong Learning as Program Synthesis
Lazar Valkov, Dipak Chaudhari, Akash Srivastava, Charles Sutton, Swarat Chaudhuri
We present a neurosymbolic framework for the lifelong learning of algorithmic tasks that mix perception and procedural reasoning. Reusing high-level concepts across domains and learning complex procedures are key challenges in lifelong learning. We show that a program synthesis approach that combines gradient descent with combinatorial search over programs can be a more effective response to these challenges than purely neural methods.
LLaVA-CMoE: Towards Continual Mixture of Experts for Large Vision-Language Models
Zhao, Hengyuan, Wang, Ziqin, Sun, Qixin, Song, Kaiyou, Li, Yilin, Hu, Xiaolin, Guo, Qingpei, Liu, Si
Although applying Mixture of Experts to large language models for learning new tasks is widely regarded as an effective strategy for continuous learning, there still remain two major challenges: (1) As the number of tasks grows, simple parameter expansion strategies can lead to excessively large models. (2) Modifying the parameters of the existing router results in the erosion of previously acquired knowledge. In this paper, we present an innovative framework named LLaVA-CMoE, which is a continuous Mixture of Experts (MoE) architecture without any replay data. Specifically, we have developed a method called Probe-Guided Knowledge Extension (PGKE), which employs probe experts to assess whether additional knowledge is required for a specific layer. This approach enables the model to adaptively expand its network parameters based on task distribution, thereby significantly improving the efficiency of parameter expansion. Additionally, we introduce a hierarchical routing algorithm called Probabilistic Task Locator (PTL), where high-level routing captures inter-task information and low-level routing focuses on intra-task details, ensuring that new task experts do not interfere with existing ones. Our experiments shows that our efficient architecture has substantially improved model performance on the Coin benchmark while maintaining a reasonable parameter count.
Parental Guidance: Efficient Lifelong Learning through Evolutionary Distillation
Zhang, Octi, Peng, Quanquan, Scalise, Rosario, Boots, Bryon
Developing robotic agents that can generalize across diverse environments while continually evolving their behaviors is a core challenge in AI and robotics. The difficulties lie in solving increasingly complex tasks and ensuring agents can continue learning without converging on narrow, specialized solutions. Quality Diversity (QD) [1, 2] methods effectively foster diversity but often rely on trial and error, where the path to a final solution can be convoluted, leading to inefficiencies and uncertainty. Our approach draws inspiration from nature's inheritance process, where offspring not only receive but also build upon the knowledge of their predecessors. Similarly, our agents inherit distilled behaviors from previous generations, allowing them to adapt and continue learning efficiently, eventually surpassing their predecessors. This natural knowledge transfer reduces randomness, guiding exploration toward more meaningful learning without manual intervention like reward shaping or task descriptors. What sets our method apart is that it offers a straightforward, evolution-inspired way to consolidate and progress, avoiding the need for manually defined styles or gradient editing [3, 4] to prevent forgetting. The agent's ability to retain and refine skills is driven by a blend of IL and RL, naturally passing down essential behaviors while implicitly discarding inferior ones. We introduce Parental Guidance (PG-1) which makes the following contributions: 1. Distributed Evolution Framework: We propose a framework that distributes the evolution process across multiple compute instances, efficiently scheduling and analyzing evolution.
No Representation, No Trust: Connecting Representation, Collapse, and Trust Issues in PPO
Reinforcement learning (RL) is inherently rife with non-stationarity since the states and rewards the agent observes during training depend on its changing policy. Therefore, networks in deep RL must be capable of adapting to new observations and fitting new targets. However, previous works have observed that networks trained under non-stationarity exhibit an inability to continue learning, termed loss of plasticity, and eventually a collapse in performance. For off-policy deep value-based RL methods, this phenomenon has been correlated with a decrease in representation rank and the ability to fit random targets, termed capacity loss. Although this correlation has generally been attributed to neural network learning under non-stationarity, the connection to representation dynamics has not been carefully studied in on-policy policy optimization methods. In this work, we empirically study representation dynamics in Proximal Policy Optimization (PPO) on the Atari and MuJoCo environments, revealing that PPO agents are also affected by feature rank deterioration and capacity loss. We show that this is aggravated by stronger non-stationarity, ultimately driving the actor's performance to collapse, regardless of the performance of the critic. We ask why the trust region, specific to methods like PPO, cannot alleviate or prevent the collapse and find a connection between representation collapse and the degradation of the trust region, one exacerbating the other. Finally, we present Proximal Feature Optimization (PFO), a novel auxiliary loss that, along with other interventions, shows that regularizing the representation dynamics mitigates the performance collapse of PPO agents.
Remember
As all readers of this essay know, I am not in any way expert in machine learning (ML) and large language models (LLMs), so my descriptions and observations are, at best, lightweight cartoons of what is actually going on. Please keep this in mind as you read this. Some of you may remember Spock's death in Star Trek II (Wrath of Khan) and the brief scene where Spock mind-melds with Dr. McCoy: Spock says "remember" while depositing his katra in McCoy's brain in anticipation of self-sacrifice to save the starship Enterprise. As I read about yet another new breakthrough in artificial intelligence (AI) from Google Research, I thought of that scene. The new idea, christened "TITAN", is for a ML system to continue learning while in use after training.a
Appendix
This is the appendix of our work: 'Structure-free Graph Condensation: From Large-scale Graphs to Condensed Graph-free Data'. In this appendix, we provide more details of the proposed SFGC in terms of related works, potential application scenarios, dataset statistics, method analysis, and experimental settings with some additional results. Dataset Distillation (Condensation) aims to synthesize a small typical dataset that distills the most important knowledge from a given large target dataset, such that the synthesized small dataset could serve as an effective substitution of the large target dataset for various scenarios [30, 49], e.g., model training and inference, architecture search, and continue learning. Typically, DD [59] and DC-KRR [39] adopted the meta-learning framework to solve bi-level distillation objectives through calculating meta-gradients. In contrast, DC [77], DM [76], and MTT [4] designed surrogate functions to avoid unrolled optimization through the gradient matching, feature distribution matching, and training trajectory matching, respectively, where the core idea is to effectively mimic the large target dataset in the synthesized small dataset.
Continual Deep Learning by Functional Regularisation of Memorable Past
Continually learning new skills is important for intelligent systems, yet standard deep learning methods suffer from catastrophic forgetting of the past. Recent works address this with weight regularisation. Functional regularisation, although computationally expensive, is expected to perform better, but rarely does so in practice. In this paper, we fix this issue by using a new functional-regularisation approach that utilises a few memorable past examples crucial to avoid forgetting. By using a Gaussian Process formulation of deep networks, our approach enables training in weight-space while identifying both the memorable past and a functional prior. Our method achieves state-of-the-art performance on standard benchmarks and opens a new direction for life-long learning where regularisation and memorybased methods are naturally combined.
Improved Schemes for Episodic Memory-based Lifelong Learning Tianbao Yang
Current deep neural networks can achieve remarkable performance on a single task. However, when the deep neural network is continually trained on a sequence of tasks, it seems to gradually forget the previous learned knowledge. This phenomenon is referred to as catastrophic forgetting and motivates the field called lifelong learning. Recently, episodic memory based approaches such as GEM [1] and A-GEM [2] have shown remarkable performance. In this paper, we provide the first unified view of episodic memory based approaches from an optimization's perspective.