anilla
A Training and
All models were trained on single GPUs, except for SchNet when trained on OC20-2M, which required 3 GPUs. Tables 9-12 present the extended results on OC20 across the 4 separate S2EF validation sets. Table 9: Evaluation results on the OC20 S2EF in-distribution validation set. In Table 13, we present the performance and inference throughput of the baseline models on COLL. Table 13: Evaluation of the performance of the four baseline models on the COLL dataset.Inference COLL test set Throughput Samples / Energy MAE Force MAE Force cos EFwT Model GPU sec.
- Asia > Singapore (0.04)
- Asia > Middle East > UAE > Abu Dhabi Emirate > Abu Dhabi (0.04)
- Asia > Indonesia > Bali (0.04)
- (6 more...)
- Information Technology (0.46)
- Government (0.46)
- Education (0.46)
- North America > Canada > Alberta (0.14)
- Asia > Middle East > Jordan (0.04)
- North America > United States > Texas > Travis County > Austin (0.04)
- Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
- North America > Canada > Alberta (0.14)
- Asia > Middle East > Jordan (0.04)
- North America > United States > Texas > Travis County > Austin (0.04)
- Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
- North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
- Oceania > Australia > New South Wales > Sydney (0.04)
- North America > United States > Pennsylvania (0.04)
- (4 more...)
- North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
- North America > United States > Pennsylvania (0.04)
- North America > United States > Illinois > Cook County > Chicago (0.04)
- (3 more...)
Table R1
ISO can perform model adaptation with a batch of instances, if available. Comparison based on faster version of ISO will be added in the revision. During inference, it is updated using Eqn. We will add more details in the revision. ISO makes marginal improvement over Joint (from 42.6 to 41.8 in MPJPE) since training and testing However, even though there is no significant distribution shift, ISO still makes positive effect.
Synthetic Eggs in Many Baskets: The Impact of Synthetic Data Diversity on LLM Fine-Tuning
Schaffelder, Max, Gatt, Albert
As synthetic data becomes widely used in language model development, understanding its impact on model behavior is crucial. This paper investigates the impact of the diversity of sources of synthetic data on fine-tuned large language models. We focus on three key dimensions: distribution collapse, adversarial robustness, and self-preference bias. Our findings reveal that fine-tuning models on synthetic data from diverse sources can mitigate distribution collapse, preserving the breadth of the output distribution and the diversity of the output text. Furthermore, while both human and synthetic fine-tuning data can remove safeguards, the latter preserves higher output quality, thus making outputs potentially more usable and dangerous. Finally, fine-tuning reduces self-preference bias, with human data being the most effective, followed by multi-source synthetic data.
- Europe > Austria > Vienna (0.14)
- Europe > Netherlands (0.04)
- North America > United States > California > Santa Clara County > Stanford (0.04)
- (4 more...)
High-Power Training Data Identification with Provable Statistical Guarantees
Liu, Zhenlong, Zeng, Hao, Huang, Weiran, Wei, Hongxin
The conventional approaches treat it as a simple binary classification task without statistical guarantees. A recent approach is designed to control the false discovery rate (FDR), but its guarantees rely on strong, easily violated assumptions. In this paper, we introduce Provable Training Data Identification (PTDI), a rigorous method that identifies a set of training data with strict false discovery rate (FDR) control. Specifically, our method computes p-values for each data point using a set of known unseen data, and then constructs a conservative estimator for the data usage proportion of the test set, which allows us to scale these p-values. Our approach then selects the final set of training data by identifying all points whose scaled p-values fall below a data-dependent threshold. This entire procedure enables the discovery of training data with provable, strict FDR control and significantly boosted power. Extensive experiments across a wide range of models (LLMs and VLMs), and datasets demonstrate that PTDI strictly controls the FDR and achieves higher power. These concerns raise the importance of identifying a specific, well-defined set of data allegedly used in training. To resolve such high-stakes disputes, claims must be supported by credible evidence that strictly controls the risk of false positives. This underscores the need for methods that provide rigorous statistical guarantees for identifying training data.
- North America > United States > California (0.14)
- North America > United States > Florida > Miami-Dade County > Miami (0.04)
- Asia > China > Shanghai > Shanghai (0.04)
- (8 more...)
- Law (1.00)
- Information Technology > Security & Privacy (1.00)
- Government > Regional Government (0.67)
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)