GREATS: Online Selection of High-Quality Data for LLM Training in Every Iteration

Neural Information Processing Systems

Online batch selection methods offer an adaptive alternative to static training data selection by dynamically selecting data batches during training. However, existing methods either rely on impractical reference models or simple heuristics that may not capture true data informativeness. To address these limitations, we propose GREedy Approximation Taylor Selection (GREATS), a principled and efficient online batch selection method that applies greedy algorithm to optimize the data batch quality approximated by Taylor expansion. We develop a series of techniques to scale GREATS to large-scale model training. Extensive experiments with large language models (LLMs) demonstrate that GREATS significantly improves training convergence speed and generalization performance.





Appendix A Proteomics Terminology and Acronyms

Neural Information Processing Systems

Table 9 highlights interesting patterns observed in Figure 3. First, the same modification occurring at different residues can have varying effects on the peptide properties, implying that including amino acid PTM information is essential to achieve better predictions. Second, some modifications have the same Unimod ID and the same molecular structure but only differ in their stereo-chemistry (spatial arrangement of atoms), yet they impact the peptide properties differently. Such scenarios are present in modified sequences and require a proper representation of PTMs (via encoding and domain-specific features) to predict peptide properties accurately. Table 11 in Appendix Section D shows the impact of PTMs on retention time for the special cases from Table 9.


PROSPECT PTMs: Rich Labeled Tandem Mass Spectrometry Dataset of Modified Peptides for Machine Learning in Proteomics Wassim Gabriel 1 Omar Shouman 1 Ayla Schroeder 1

Neural Information Processing Systems

Post-Translational Modifications (PTMs) are changes that occur in proteins after synthesis, influencing their structure, function, and cellular behavior. PTMs are essential in cell biology; they regulate protein function and stability, are involved in various cellular processes, and are linked to numerous diseases. A particularly interesting class of PTMs are chemical modifications such as phosphorylation introduced on amino acid side chains because they can drastically alter the physicochemical properties of the peptides once they are present. One or more PTMs can be attached to each amino acid of the peptide sequence. The most commonly applied technique to detect PTMs on proteins is bottom-up Mass Spectrometrybased proteomics (MS), where proteins are digested into peptides and subsequently analyzed using Tandem Mass Spectrometry (MS/MS).



Bayesian Online Natural Gradient (BONG)

Neural Information Processing Systems

We propose a novel approach to sequential Bayesian inference based on variational Bayes (VB). The key insight is that, in the online setting, we do not need to add the KL term to regularize to the prior (which comes from the posterior at the previous timestep); instead we can optimize just the expected log-likelihood, performing a single step of natural gradient descent starting at the prior predictive. We prove this method recovers exact Bayesian inference if the model is conjugate. We also show how to compute an efficient deterministic approximation to the VB objective, as well as our simplified objective, when the variational distribution is Gaussian or a sub-family, including the case of a diagonal plus low-rank precision matrix. We show empirically that our method outperforms other online VB methods in the non-conjugate setting, such as online learning for neural networks, especially when controlling for computational costs.