memory requirement
VeLoRA: MemoryEfficientTrainingusing Rank-1Sub-TokenProjections
Using a single projection vector, we then project these individual sub-tokens onto a one-dimensional subspace. Importantly, we notice that we can initialize this projection vector cheaply using first-order batch statistics andthen keepitfixedthroughout training. Wethen reconstruct the original tokens using the same vector during the backward pass.
A Appendix
Perplexity vs. FLOP count of MIM compared to left-to-right baselines across model sizes. To evaluate the effectiveness of "Meet in the Middle" (MIM) pre-training compared to left-to-right Perplexity vs. training time of MIM compared to left-to-right baselines across model sizes. Our largest models of size 2.7B parameters are trained using 128 A100 GPU with 80GB See Table 10 for the details of all the training runs. This paper presents "Meet in the Middle", a novel pretraining paradigm for language models that The proposed method's secondary benefits in the infilling task could also improve several NLP tasks, such as text summarization and question answering, leading to better usability and overall
Streaming Algorithms and Lower Bounds for Estimating Correlation Clustering Cost
Correlation clustering is a fundamental optimization problem at the intersection of machine learning and theoretical computer science. Motivated by applications to big data processing, recent years have witnessed a flurry of results on this problem in the streaming model. In this model, the algorithm needs to process the input $n$-vertex graph by making one or few passes over the stream of its edges and using a limited memory, much smaller than the input size. All previous work on streaming correlation clustering have focused on semi-streaming algorithms with $\Omega(n)$ memory, whereas in this work, we study streaming algorithms with much smaller memory requirement of only $\text{polylog}{(n)}$ bits. This stringent memory requirement is in the same spirit of classical streaming algorithms that instead of recovering a full solution to the problem---which can be prohibitively large with such small memory as is the case in our problem---, aimed to learn certain statistical properties of their inputs.