data selection
MATES: Model-Aware Data Selection for Efficient Pretraining with Data Influence Models
Experiments of pretraining 410M and 1B models on the C4 dataset demonstrate that MA TES significantly outperforms random data selection on extensive downstream tasks. It doubles the gains achieved by the state-of-the-art data selection approach that leverages larger reference models and reduces the total FLOPs required to reach certain performances by half. Further analyses validate the effectiveness of the locally probed oracle data influence and the approximation with data influence models. Our code is open-sourced at https://github.com/cxcscmu/MA
Country:
- Asia > Middle East > Jordan (0.04)
- North America > United States > Pennsylvania > Allegheny County > Pittsburgh (0.04)
- North America > United States > Michigan (0.04)
- Europe (0.04)
Technology:
Country:
- North America > United States > Maryland (0.04)
- Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
Genre:
- Research Report > Experimental Study (1.00)
- Research Report > New Finding (0.93)
Technology:
- Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.69)
Country:
- Asia > Macao (0.04)
- Asia > China > Guangdong Province > Shenzhen (0.04)
- Europe > Belgium > Brussels-Capital Region > Brussels (0.04)
- (8 more...)
Genre:
- Research Report > New Finding (1.00)
- Research Report > Experimental Study (1.00)
Technology:
Country:
- Asia > Middle East > Jordan (0.04)
- Europe > United Kingdom > England > Oxfordshire > Oxford (0.04)
- North America > United States > Washington > King County > Seattle (0.04)
- (3 more...)
Genre:
- Research Report > Experimental Study (1.00)
- Research Report > New Finding (0.93)
Technology:
Country:
Technology:
- Information Technology > Sensing and Signal Processing > Image Processing (1.00)
- Information Technology > Artificial Intelligence > Vision (1.00)
- Information Technology > Artificial Intelligence > Natural Language (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.46)
Country:
- North America > United States > California > Los Angeles County > Los Angeles (0.14)
- North America > Canada > Ontario > Toronto (0.04)
- Asia > Middle East > Jordan (0.04)
- (7 more...)
Genre:
- Research Report > Experimental Study (1.00)
- Research Report > New Finding (0.92)
Industry:
- Information Technology (0.92)
- Health & Medicine > Diagnostic Medicine (0.67)
Technology:
Country:
- North America > United States (0.14)
- Europe > Slovenia > Drava > Municipality of Benedikt > Benedikt (0.04)
- Europe > Italy > Calabria > Catanzaro Province > Catanzaro (0.04)
- Asia > Middle East > Jordan (0.04)
Technology:
- Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
- Information Technology > Artificial Intelligence > Vision (0.68)
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.49)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)
Country:
- Asia > Middle East > Jordan (0.04)
- Europe > Italy > Tuscany > Florence (0.04)
- North America > United States > Indiana (0.04)
- (7 more...)
Technology:
- Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
- Information Technology > Artificial Intelligence > Natural Language (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)
- Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Regression (0.46)
Country:
- North America > United States (0.28)
- Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
- Asia > Middle East > Jordan (0.04)
- (2 more...)
Technology:
Country:
- North America > United States > Wisconsin > Dane County > Madison (0.14)
- North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
- Europe > Switzerland > Zürich > Zürich (0.14)
- (8 more...)
Genre:
- Research Report > Experimental Study (0.93)
- Workflow (0.68)
Technology:
- Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
- Information Technology > Information Management (0.94)
- Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.67)
- Information Technology > Artificial Intelligence > Natural Language > Information Retrieval (0.47)