train-test split
Limitations
While our study identifies clear separations between model hypothesis classes, our best models still have not reached the consistency ceiling of the neural and behavioral benchmarks we have compared against. All models were simultaneously trained across all eight scenarios of the Physion Dynamics Training Set, constituting around 16,000 total training scenarios (2,000 scenes per scenario) [Bear et al., 2021], with a Each C-SWM [Kipf et al., 2020] model was trained on For each stimulus, we compute the proportion of "hit" responses by The Correlation to A verage Human Response is the Pearson's correlation between the model probability-hit vector and the human proportion-hit vector, across stimuli per scenario. OCP Accuracy of humans and models is the average accuracy, across stimuli per scenario. To give the final values of the two quantities, we then compute the weighted mean and s.e.m. of the above per Note that these values are therefore different for each condition, but always the same across all models. All neural predictivities are reported on heldout conditions and their timepoints.
Not All Splits Are Equal: Rethinking Attribute Generalization Across Unrelated Categories
Fircă, Liviu Nicolae, Bărbălau, Antonio, Oneata, Dan, Burceanu, Elena
Can models generalize attribute knowledge across semantically and perceptually dissimilar categories? While prior work has addressed attribute prediction within narrow taxonomic or visually similar domains, it remains unclear whether current models can abstract attributes and apply them to conceptually distant categories. This work presents the first explicit evaluation for the robustness of the attribute prediction task under such conditions, testing whether models can correctly infer shared attributes between unrelated object types: e.g., identifying that the attribute "has four legs" is common to both "dogs" and "chairs". To enable this evaluation, we introduce train-test split strategies that progressively reduce correlation between training and test sets, based on: LLM-driven semantic grouping, embedding similarity thresholding, embedding-based clustering, and supercategory-based partitioning using ground-truth labels. Results show a sharp drop in performance as the correlation between training and test categories decreases, indicating strong sensitivity to split design. Among the evaluated methods, clustering yields the most effective trade-off, reducing hidden correlations while preserving learnability. These findings offer new insights into the limitations of current representations and inform future benchmark construction for attribute reasoning.
- Europe > Romania > București - Ilfov Development Region > Municipality of Bucharest > Bucharest (0.05)
- North America > United States > California (0.04)
Supplementary: Characterizing Generalization under Out-Of-Distribution Shifts in Deep Metric Learning A Analyzing the model bias for selecting train-test splits
These settings are used throughout our study. In Tab. 1 we show the measured FID scores between each For each dataset we show examples for an easy, medium and hard train-test split. Tab. 2 first illustrates the FID scores for all pairwise combinations However, the fact that FID scores are relatively close to another despite large semantic differences between datasets may indicate that FID based on our utilised FID estimator (Sec. This section provides additional results for the experiments presented in Sec. 4 in the main paper. To this end, we provide the exact performance values used to visualize Figure 1 in the main paper in Tab.
- Oceania > Australia > New South Wales > Sydney (0.04)
- North America > United States > California (0.04)
- North America > Canada > Ontario > Toronto (0.14)
- North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
- Europe > Germany > Bavaria > Upper Bavaria > Munich (0.04)
- (3 more...)
- North America > United States > North Carolina > Durham County > Durham (0.04)
- North America > Canada > British Columbia > Vancouver (0.04)
- Europe > United Kingdom > England > Oxfordshire > Oxford (0.04)
- Europe > Netherlands > South Holland > Leiden (0.04)
- Health & Medicine > Therapeutic Area (1.00)
- Government > Regional Government > North America Government > United States Government (0.67)
- Information Technology (0.67)
- Information Technology > Data Science (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.68)
- Information Technology > Artificial Intelligence > Representation & Reasoning > Optimization (0.67)
- Information Technology > Artificial Intelligence > Machine Learning > Decision Tree Learning (0.46)
- North America > United States > California (0.04)
- Europe > France (0.04)
- Asia > Middle East > Jordan (0.04)
- Health & Medicine > Pharmaceuticals & Biotechnology (1.00)
- Government > Regional Government > North America Government > United States Government (0.93)
- Materials > Chemicals (0.93)
- Information Technology (0.67)
- Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.95)
- Information Technology > Biomedical Informatics > Translational Bioinformatics (0.93)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.69)
Zero-Shot Performance Prediction for Probabilistic Scaling Laws
Schram, Viktoria, Hiller, Markus, Beck, Daniel, Cohn, Trevor
The prediction of learning curves for Natural Language Processing (NLP) models enables informed decision-making to meet specific performance objectives, while reducing computational overhead and lowering the costs associated with dataset acquisition and curation. In this work, we formulate the prediction task as a multitask learning problem, where each task's data is modelled as being organized within a two-layer hierarchy. To model the shared information and dependencies across tasks and hierarchical levels, we employ latent variable multi-output Gaussian Processes, enabling to account for task correlations and supporting zero-shot prediction of learning curves (LCs). We demonstrate that this approach facilitates the development of probabilistic scaling laws at lower costs. Applying an active learning strategy, LCs can be queried to reduce predictive uncertainty and provide predictions close to ground truth scaling laws. We validate our framework on three small-scale NLP datasets with up to $30$ LCs. These are obtained from nanoGPT models, from bilingual translation using mBART and Transformer models, and from multilingual translation using M2M100 models of varying sizes.
- Oceania > Australia (0.04)
- North America > United States > Wisconsin > Dane County > Madison (0.04)
- Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
- (3 more...)
- Information Technology > Artificial Intelligence > Natural Language > Machine Translation (1.00)
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)
- (2 more...)
- North America > United States > North Carolina > Durham County > Durham (0.04)
- North America > Canada > British Columbia > Vancouver (0.04)
- Europe > United Kingdom > England > Oxfordshire > Oxford (0.04)
- Europe > Netherlands > South Holland > Leiden (0.04)
- Health & Medicine > Therapeutic Area (1.00)
- Government > Regional Government > North America Government > United States Government (0.67)
- Information Technology (0.67)
- Information Technology > Data Science > Data Mining (0.68)
- Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.68)
- Information Technology > Artificial Intelligence > Representation & Reasoning > Optimization (0.67)
- Information Technology > Artificial Intelligence > Machine Learning > Decision Tree Learning (0.46)
Supplementary: Characterizing Generalization under Out-Of-Distribution Shifts in Deep Metric Learning A Analyzing the model bias for selecting train-test splits
These settings are used throughout our study. In Tab. 1 we show the measured FID scores between each For each dataset we show examples for an easy, medium and hard train-test split. Tab. 2 first illustrates the FID scores for all pairwise combinations However, the fact that FID scores are relatively close to another despite large semantic differences between datasets may indicate that FID based on our utilised FID estimator (Sec. This section provides additional results for the experiments presented in Sec. 4 in the main paper. To this end, we provide the exact performance values used to visualize Figure 1 in the main paper in Tab.
- Oceania > Australia > New South Wales > Sydney (0.04)
- North America > United States > California (0.04)