AITopics

2504.13101

Country:

North America > United States (0.67)
Europe (0.46)

Genre: Research Report > New Finding (0.46)

Industry: Health & Medicine (0.68)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)
(2 more...)

arXiv.org Artificial IntelligenceFeb-17-2025

Pretraining Frequency Predicts Compositional Generalization of CLIP on Real-World Tasks

Wiedemer, Thaddäus, Sharma, Yash, Prabhu, Ameya, Bethge, Matthias, Brendel, Wieland

We investigate the success conditions for compositional generalization of CLIP models on real-world data through performance prediction. Prior work shows that CLIP requires exponentially more pretraining data for linear performance gains on individual concepts. This sample-inefficient scaling could be mitigated if CLIP systematically understood new inputs as compositions of learned components, allowing rare observation to be mapped to common concepts. To explore CLIP's compositional generalization ability, we filter retrieval corpora for samples with object combinations not present in the pretraining corpus. We show that CLIP's performance on these samples can be accurately predicted from the pretraining frequencies of individual objects. Our findings demonstrate that CLIP learns to disentangle objects observed in its pretraining data and can recompose them straightforwardly. Additionally, we are the first to show how this ability scales with pretraining data. For data curation in practice, our results suggest that balancing object occurrences improves generalization, which should benefit CLIP's efficiency and accuracy without scaling data volume.

artificial intelligence, machine learning, natural language, (14 more...)

2502.18326

Country: Europe > Germany (0.15)

Genre: Research Report > New Finding (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Artificial Intelligence > Vision (0.70)

arXiv.org Artificial IntelligenceFeb-17-2025

LLMs on the Line: Data Determines Loss-to-Loss Scaling Laws

Mayilvahanan, Prasanna, Wiedemer, Thaddäus, Mallick, Sayak, Bethge, Matthias, Brendel, Wieland

Scaling laws guide the development of large language models (LLMs) by offering estimates for the optimal balance of model size, tokens, and compute. More recently, loss-to-loss scaling laws that relate losses across pretraining datasets and downstream tasks have emerged as a powerful tool for understanding and improving LLM performance. In this work, we investigate which factors most strongly influence loss-to-loss scaling. Our experiments reveal that the pretraining data and tokenizer determine the scaling trend. In contrast, model size, optimization hyperparameters, and even significant architectural differences, such as between transformer-based models like Llama and state-space models like Mamba, have limited impact. Consequently, practitioners should carefully curate suitable pretraining datasets for optimal downstream performance, while architectures and other settings can be freely optimized for training efficiency.

large language model, machine learning, natural language, (17 more...)

2502.1212

Country: Europe (0.28)

Genre: Research Report > New Finding (0.69)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

arXiv.org Artificial IntelligenceNov-12-2024

Interaction Asymmetry: A General Principle for Learning Composable Abstractions

Brady, Jack, von Kügelgen, Julius, Lachapelle, Sébastien, Buchholz, Simon, Kipf, Thomas, Brendel, Wieland

Learning disentangled representations of concepts and re-composing them in unseen ways is crucial for generalizing to out-of-domain situations. However, the underlying properties of concepts that enable such disentanglement and compositional generalization remain poorly understood. In this work, we propose the principle of interaction asymmetry which states: "Parts of the same concept have more complex interactions than parts of different concepts". We formalize this via block diagonality conditions on the $(n+1)$th order derivatives of the generator mapping concepts to observed data, where different orders of "complexity" correspond to different $n$. Using this formalism, we prove that interaction asymmetry enables both disentanglement and compositional generalization. Our results unify recent theoretical results for learning concepts of objects, which we show are recovered as special cases with $n\!=\!0$ or $1$. We provide results for up to $n\!=\!2$, thus extending these prior works to more flexible generator functions, and conjecture that the same proof strategies generalize to larger $n$. Practically, our theory suggests that, to disentangle concepts, an autoencoder should penalize its latent capacity and the interactions between concepts during decoding. We propose an implementation of these criteria using a flexible Transformer-based VAE, with a novel regularizer on the attention weights of the decoder. On synthetic image datasets consisting of objects, we provide evidence that this model can achieve comparable object disentanglement to existing models that use more explicit object-centric priors.

artificial intelligence, machine learning, natural language, (17 more...)

2411.07784

Country: Europe > Germany (0.28)

Genre: Research Report > New Finding (0.48)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
Information Technology > Artificial Intelligence > Cognitive Science (1.00)

arXiv.org Machine LearningNov-4-2024

Cross-Entropy Is All You Need To Invert the Data Generating Process

Reizinger, Patrik, Bizeul, Alice, Juhos, Attila, Vogt, Julia E., Balestriero, Randall, Brendel, Wieland, Klindt, David

Supervised learning has become a cornerstone of modern machine learning, yet a comprehensive theory explaining its effectiveness remains elusive. Empirical phenomena, such as neural analogy-making and the linear representation hypothesis, suggest that supervised models can learn interpretable factors of variation in a linear fashion. Recent advances in self-supervised learning, particularly nonlinear Independent Component Analysis, have shown that these methods can recover latent structures by inverting the data generating process. We extend these identifiability results to parametric instance discrimination, then show how insights transfer to the ubiquitous setting of supervised learning with cross-entropy minimization. We prove that even in standard classification tasks, models learn representations of ground-truth factors of variation up to a linear transformation. We corroborate our theoretical contribution with a series of empirical studies. First, using simulated data matching our theoretical assumptions, we demonstrate successful disentanglement of latent factors. Second, we show that on DisLib, a widely-used disentanglement benchmark, simple classification tasks recover latent structures up to linear transformations. Finally, we reveal that models trained on ImageNet encode representations that permit linear decoding of proxy factors of variation. Together, our theoretical findings and experiments offer a compelling explanation for recent observations of linear representations, such as superposition in neural networks. This work takes a significant step toward a cohesive theory that accounts for the unreasonable effectiveness of supervised deep learning.

artificial intelligence, latent variable, machine learning, (14 more...)

2410.21869

Country:

North America > United States (0.46)
Europe > Germany > Baden-Württemberg > Tübingen Region > Tübingen (0.14)
Europe > Switzerland > Zürich > Zürich (0.14)

Genre: Research Report (0.83)

Industry: Government (0.46)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.66)

arXiv.org Machine LearningOct-24-2024

Rule Extrapolation in Language Models: A Study of Compositional Generalization on OOD Prompts

Mészáros, Anna, Ujváry, Szilvia, Brendel, Wieland, Reizinger, Patrik, Huszár, Ferenc

LLMs show remarkable emergent abilities, such as inferring concepts from presumably out-of-distribution prompts, known as in-context learning. Though this success is often attributed to the Transformer architecture, our systematic understanding is limited. In complex real-world data sets, even defining what is out-of-distribution is not obvious. To better understand the OOD behaviour of autoregressive LLMs, we focus on formal languages, which are defined by the intersection of rules. We define a new scenario of OOD compositional generalization, termed rule extrapolation. Rule extrapolation describes OOD scenarios, where the prompt violates at least one rule. We evaluate rule extrapolation in formal languages with varying complexity in linear and recurrent architectures, the Transformer, and state space models to understand the architectures' influence on rule extrapolation. We also lay the first stones of a normative theory of rule extrapolation, inspired by the Solomonoff prior in algorithmic information theory.

large language model, machine learning, natural language, (20 more...)

2409.13728

Country:

Europe > Germany (0.28)
Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.14)

Genre: Research Report > Experimental Study (0.93)

Industry: Information Technology (0.93)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.93)
Information Technology > Artificial Intelligence > Representation & Reasoning > Uncertainty (0.93)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.88)

arXiv.org Machine LearningJun-28-2024

InfoNCE: Identifying the Gap Between Theory and Practice

Rusak, Evgenia, Reizinger, Patrik, Juhos, Attila, Bringmann, Oliver, Zimmermann, Roland S., Brendel, Wieland

Previous theoretical work on contrastive learning (CL) with InfoNCE showed that, under certain assumptions, the learned representations uncover the ground-truth latent factors. We argue these theories overlook crucial aspects of how CL is deployed in practice. Specifically, they assume that within a positive pair, all latent factors either vary to a similar extent, or that some do not vary at all. However, in practice, positive pairs are often generated using augmentations such as strong cropping to just a few pixels. Hence, a more realistic assumption is that all latent factors change, with a continuum of variability across these factors. We introduce AnInfoNCE, a generalization of InfoNCE that can provably uncover the latent factors in this anisotropic setting, broadly generalizing previous identifiability results in CL. We validate our identifiability results in controlled experiments and show that AnInfoNCE increases the recovery of previously collapsed information in CIFAR10 and ImageNet, albeit at the cost of downstream accuracy. Additionally, we explore and discuss further mismatches between theoretical assumptions and practical implementations, including extensions to hard negative mining and loss ensembles.

aninfonce, artificial intelligence, machine learning, (15 more...)

2407.00143

Country:

Europe > Germany > Baden-Württemberg > Tübingen Region > Tübingen (0.14)
North America > United States > New York (0.14)
North America > Canada > Ontario > Toronto (0.14)

Genre:

Research Report > Experimental Study (0.68)
Research Report > Strength High (0.54)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)
Information Technology > Artificial Intelligence > Vision (0.92)

arXiv.org Machine LearningJun-17-2024

Position: Understanding LLMs Requires More Than Statistical Generalization

Reizinger, Patrik, Ujváry, Szilvia, Mészáros, Anna, Kerekes, Anna, Brendel, Wieland, Huszár, Ferenc

The last decade has seen blossoming research in deep learning theory attempting to answer, "Why does deep learning generalize?" A powerful shift in perspective precipitated this progress: the study of overparametrized models in the interpolation regime. In this paper, we argue that another perspective shift is due, since some of the desirable qualities of LLMs are not a consequence of good statistical generalization and require a separate theoretical explanation. Our core argument relies on the observation that AR probabilistic models are inherently non-identifiable: models zero or near-zero KL divergence apart -- thus, equivalent test loss -- can exhibit markedly different behaviors. We support our position with mathematical examples and empirical observations, illustrating why non-identifiability has practical relevance through three case studies: (1) the non-identifiability of zero-shot rule extrapolation; (2) the approximate non-identifiability of in-context learning; and (3) the non-identifiability of fine-tunability. We review promising research directions focusing on LLM-relevant generalization measures, transferability, and inductive biases.

large language model, machine learning, natural language, (17 more...)

2405.01964

Country:

Europe > Germany > Baden-Württemberg > Tübingen Region > Tübingen (0.14)
North America > United States > Massachusetts > Middlesex County > Cambridge (0.14)
Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.14)

Genre: Research Report > New Finding (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Undirected Networks > Markov Models (0.46)

arXiv.org Artificial IntelligenceDec-7-2023

If your data distribution shifts, use self-learning

Rusak, Evgenia, Schneider, Steffen, Pachitariu, George, Eck, Luisa, Gehler, Peter, Bringmann, Oliver, Brendel, Wieland, Bethge, Matthias

We demonstrate that self-learning techniques like entropy minimization and pseudo-labeling are simple and effective at improving performance of a deployed computer vision model under systematic domain shifts. We conduct a wide range of large-scale experiments and show consistent improvements irrespective of the model architecture, the pre-training technique or the type of distribution shift. At the same time, self-learning is simple to use in practice because it does not require knowledge or access to the original training data or scheme, is robust to hyperparameter choices, is straight-forward to implement and requires only a few adaptation epochs. This makes self-learning techniques highly attractive for any practitioner who applies machine learning algorithms in the real world.

artificial intelligence, deep learning, machine learning, (18 more...)

2104.12928

Country:

Europe > Germany > Baden-Württemberg > Tübingen Region > Tübingen (0.14)
North America > Canada > British Columbia > Metro Vancouver Regional District > Vancouver (0.14)
North America > United States > California > San Francisco County > San Francisco (0.14)

Genre: Research Report > New Finding (0.67)

Industry:

Education (0.46)
Information Technology (0.45)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.68)

arXiv.org Artificial IntelligenceNov-29-2023

An Interventional Perspective on Identifiability in Gaussian LTI Systems with Independent Component Analysis

Rajendran, Goutham, Reizinger, Patrik, Brendel, Wieland, Ravikumar, Pradeep

We investigate the relationship between system identification and intervention design in dynamical systems. While previous research demonstrated how identifiable representation learning methods, such as Independent Component Analysis (ICA), can reveal cause-effect relationships, it relied on a passive perspective without considering how to collect data. Our work shows that in Gaussian Linear Time-Invariant (LTI) systems, the system parameters can be identified by introducing diverse intervention signals in a multi-environment setting. By harnessing appropriate diversity assumptions motivated by the ICA literature, our findings connect experiment design and representational identifiability in dynamical systems. We corroborate our findings on synthetic and (simulated) physical data. Additionally, we show that Hidden Markov Models, in general, and (Gaussian) LTI systems, in particular, fulfil a generalization of the Causal de Finetti theorem with continuous parameters.

artificial intelligence, lti system, machine learning, (15 more...)

2311.18048

Country:

Europe (0.68)
North America > United States > New York (0.14)

Genre: Research Report > New Finding (1.00)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Undirected Networks > Markov Models (1.00)