seq-JEPA: Autoregressive Predictive Learning of Invariant-Equivariant World Models
–Neural Information Processing Systems
Joint-embedding self-supervised learning (SSL) commonly relies on transformations such as data augmentation and masking to learn visual representations, a task achieved by enforcing invariance or equivariance with respect to these transformations applied to two views of an image. This dominant two-view paradigm in SSL often limits the flexibility of learned representations for downstream adaptation by creating performance trade-offs between high-level invariance-demanding tasks such as image classification and more fine-grained equivariance-related tasks. In this work, we propose seq-JEPA, a world modeling framework that introduces architectural inductive biases into joint-embedding predictive architectures to resolve this trade-off. Without relying on dual equivariance predictors or loss terms, seq-JEPA simultaneously learns two architecturally separate representations for equivariance-and invariance-demanding tasks. To do so, our model processes short sequences of different views (observations) of inputs.
Neural Information Processing Systems
Jun-16-2026, 00:57:23 GMT
- Country:
- North America > Canada (0.46)
- Genre:
- Research Report
- New Finding (1.00)
- Experimental Study (1.00)
- Research Report
- Industry:
- Health & Medicine > Therapeutic Area > Neurology (0.46)
- Technology:
- Information Technology
- Sensing and Signal Processing > Image Processing (0.87)
- Artificial Intelligence
- Vision (1.00)
- Representation & Reasoning (1.00)
- Natural Language (1.00)
- Machine Learning > Neural Networks (1.00)
- Cognitive Science (1.00)
- Information Technology