A Theoretical Analysis
This section contains the theoretical analysis of the loss functions of offline experience replay (Proposition 2), augmented experience replay (Proposition 3), and online experience replay with reservoir sampling (Proposition 1). At each iteration t, t = 1,..T, a batch of data is sampled from the incoming task, B Note 3: Consider a balanced continual learning dataset (e.g., Split-CIFAR100, Split-Mini-ImageNet) where |D Note 4: Consider general continual learning datasets. Table 3 lists the image size, the number of classes, the number of tasks, and data size per task of the four CL benchmarks. C.1 Continual Learning Implementation The hyperparameter settings are summarized in Table 4. All models are optimized using vanilla SGD.
Repeated Augmented Rehearsal: A Simple but Strong Baseline for Online Continual Learning
Online continual learning (OCL) aims to train neural networks incrementally from a non-stationary data stream with a single pass through data. Rehearsal-based methods attempt to approximate the observed input distributions over time with a small memory and revisit them later to avoid forgetting. Despite their strong empirical performance, rehearsal methods still suffer from a poor approximation of past data's loss landscape with memory samples. This paper revisits the rehearsal dynamics in online settings. We provide theoretical insights on the inherent memory overfitting risk from the viewpoint of biased and dynamic empirical risk minimization, and examine the merits and limits of repeated rehearsal. Inspired by our analysis, a simple and intuitive baseline, repeated augmented rehearsal (RAR), is designed to address the underfitting-overfitting dilemma of online rehearsal. Surprisingly, across four rather different OCL benchmarks, this simple baseline outperforms vanilla rehearsal by 9%-17% and also significantly improves the state-of-the-art rehearsal-based methods MIR, ASER, and SCR. We also demonstrate that RAR successfully achieves an accurate approximation of the loss landscape of past data and high-loss ridge aversion in its learning trajectory. Extensive ablation studies are conducted to study the interplay between repeated and augmented rehearsal, and reinforcement learning (RL) is applied to dynamically adjust the hyperparameters of RAR to balance the stability-plasticity trade-off online.
OptEx: Expediting First-Order Optimization with Approximately Parallelized Iterations, Ying Tiffany He
First-order optimization (FOO) algorithms are pivotal in numerous computational domains, such as reinforcement learning and deep learning. However, their application to complex tasks often entails significant optimization inefficiency due to their need of many sequential iterations for convergence. In response, we introduce first-order optimization expedited with approximately parallelized iterations (OptEx), the first general framework that enhances the optimization efficiency of FOO by leveraging parallel computing to directly mitigate its requirement of many sequential iterations for convergence. To achieve this, OptEx utilizes a kernelized gradient estimation that is based on the history of evaluated gradients to predict the gradients required by the next few sequential iterations in FOO, which helps to break the inherent iterative dependency and hence enables the approximate parallelization of iterations in FOO. We further establish theoretical guarantees for the estimation error of our kernelized gradient estimation and the iteration complexity of SGD-based OptEx, confirming that the estimation error diminishes to zero as the history of gradients accumulates and that our SGD-based OptEx enjoys an effective acceleration rate of ฮ( N) over standard SGD given parallelism of N, in terms of the sequential iterations required for convergence. Finally, we provide extensive empirical studies, including synthetic functions, reinforcement learning tasks, and neural network training on various datasets, to underscore the substantial efficiency improvements achieved by OptEx in practice. Our implementation is available at https://github.com/youyve/OptEx.
6 Appendix
We also need "strides" as input to indicate how many new blocks will be kept in each step. BM25 is a famous TF-IDF-like information retrieval method. Each block is scored based on the common words with query or textual label. However, the semantic relevance are neglected. For example, BM25 fails to find the relevance between label name "sports" with "baseball player". Glove is a group of pretrained word representation.
CogLTX: Applying BERT to Long Texts Chang Zhou Tsinghua University
BERT is incapable of processing long texts due to its quadratically increasing memory and time consumption. The most natural ways to address this problem, such as slicing the text by a sliding window or simplifying transformers, suffer from insufficient long-range attentions or need customized CUDA kernels. The maximum length limit in BERT reminds us the limited capacity (5 9 chunks) of the working memory of humans --- then how do human beings Cognize Long TeXts?
UDA: A Benchmark Suite for Retrieval Augmented Generation in Real-world Document Analysis
The use of Retrieval-Augmented Generation (RAG) has improved Large Language Models (LLMs) in collaborating with external data, yet significant challenges exist in real-world scenarios. In areas such as academic literature and finance question answering, data are often found in raw text and tables in HTML or PDF formats, which can be lengthy and highly unstructured. In this paper, we introduce a benchmark suite, namely Unstructured Document Analysis (UDA), that involves 2,965 real-world documents and 29,590 expert-annotated Q&A pairs. We revisit popular LLMand RAG-based solutions for document analysis and evaluate the design choices and answer qualities across multiple document domains and diverse query types. Our evaluation yields interesting findings and highlights the importance of data parsing and retrieval. We hope our benchmark can shed light and better serve real-world document analysis applications.
VaiPhy: a Variational Inference Based Algorithm for Phylogeny Appendix
A.1 Update Equation Details The update equations of VaiPhy follow the standard mean-field VI updates. Furthermore, i is the set of nodes except node i, and C is a constant. Hence, during the training of VaiPhy, we used a maximum likelihood heuristic to update the branch lengths given a tree topology. After the training, we used the tree topologies sampled from SLANTIS and corresponding branch lengths sampled from the JC sampler to compute IWELBO. A.2 Neighbor-Joining Initialization We utilize the NJ algorithm to initialize VaiPhy with a reasonable state. The sequence data is fed into BIONJ, an NJ algorithm, to create an initial reference phylogenetic tree using the PhyML software, version 3.3.20200621