Goto

Collaborating Authors

 compositional structure



Fully Distributed, Flexible Compositional Visual Representations via Soft Tensor Products

Neural Information Processing Systems

Since the inception of the classicalist vs. connectionist debate, it has been argued that the ability to systematically combine symbol-like entities into compositional representations is crucial for human intelligence. In connectionist systems, the field of disentanglement has gained prominence for its ability to produce explicitly compositional representations; however, it relies on a fundamentally representation of compositional structure that clashes with the foundations of deep learning. To resolve this tension, we extend Smolensky's Tensor Product Representation (TPR) and introduce, a representational form that encodes compositional structure in an inherently manner, along with, a theoretically-principled architecture designed specifically to learn Soft TPRs. Comprehensive evaluations in the visual representation learning domain demonstrate that the Soft TPR framework consistently outperforms conventional disentanglement alternatives -- achieving state-of-the-art disentanglement, boosting representation learner convergence, and delivering superior sample efficiency and low-sample regime performance in downstream tasks.


Activity Grammars for Temporal Action Segmentation

Neural Information Processing Systems

Sequence prediction on temporal data requires the ability to understand compositional structures of multi-level semantics beyond individual and contextual properties of parts. The task of temporal action segmentation remains challenging for the reason, aiming at translating an untrimmed activity video into a sequence of action segments. This paper addresses the problem by introducing an effective activity grammar to guide neural predictions for temporal action segmentation. We propose a novel grammar induction algorithm, dubbed KARI, that extracts a powerful context-free grammar from action sequence data. We also develop an efficient generalized parser, dubbed BEP, that transforms frame-level probability distributions into a reliable sequence of actions according to the induced grammar with recursive rules. Our approach can be combined with any neural network for temporal action segmentation to enhance the sequence prediction and discover its compositional structure. Experimental results demonstrate that our method significantly improves temporal action segmentation in terms of both performance and interpretability on two standard benchmarks, Breakfast and 50 Salads.


New Approximation Results and Optimal Estimation for Fully Connected Deep Neural Networks

Tang, Zhaoji

arXiv.org Machine Learning

\citet{farrell2021deep} establish non-asymptotic high-probability bounds for general deep feedforward neural network (with rectified linear unit activation function) estimators, with \citet[Theorem 1]{farrell2021deep} achieving a suboptimal convergence rate for fully connected feedforward networks. The authors suggest that improved approximation of fully connected networks could yield sharper versions of \citet[Theorem 1]{farrell2021deep} without altering the theoretical framework. By deriving approximation bounds specifically for a narrower fully connected deep neural network, this note demonstrates that \citet[Theorem 1]{farrell2021deep} can be improved to achieve an optimal rate (up to a logarithmic factor). Furthermore, this note briefly shows that deep neural network estimators can mitigate the curse of dimensionality for functions with compositional structure and functions defined on manifolds.


How Does RL Post-training Induce Skill Composition? A Case Study on Countdown

Park, Simon, Kaur, Simran, Arora, Sanjeev

arXiv.org Artificial Intelligence

While reinforcement learning (RL) successfully enhances reasoning in large language models, its role in fostering compositional generalization (the ability to synthesize novel skills from known components) is often conflated with mere length generalization. To this end, we study what RL post-training teaches about skill composition and how the structure of the composition affects the skill transfer. We focus on the Countdown task (given n numbers and a target, form an expression that evaluates to the target) and analyze model solutions as expression trees, where each subtree corresponds to a reusable subtask and thus can be viewed as a ``skill.'' Tracking tree shapes and their success rates over training, we find: (i) out-of-distribution (OOD) generalization to larger n and to unseen tree shapes, indicating compositional reuse of subtasks; (ii) a structure-dependent hierarchy of learnability -- models master shallow balanced trees (workload is balanced between subtasks) before deep unbalanced ones, with persistent fragility on right-heavy structures (even when the composition depth is the same as some left-heavy structures). Our diagnostic reveals what is learned, in what order, and where generalization fails, clarifying how RL-only post-training induces OOD generalization beyond what standard metrics such as pass@k reveal.



On the Convergence of Overparameterized Problems: Inherent Properties of the Compositional Structure of Neural Networks

de Oliveira, Arthur Castello Branco, Jatkar, Dhruv, Sontag, Eduardo

arXiv.org Artificial Intelligence

This paper investigates how the compositional structure of neural networks shapes their optimization landscape and training dynamics. We analyze the gradient flow associated with overparameterized optimization problems, which can be interpreted as training a neural network with linear activations. Remarkably, we show that the global convergence properties can be derived for any cost function that is proper and real analytic. We then specialize the analysis to scalar-valued cost functions, where the geometry of the landscape can be fully characterized. In this setting, we demonstrate that key structural features -- such as the location and stability of saddle points -- are universal across all admissible costs, depending solely on the overparameterized representation rather than on problem-specific details. Moreover, we show that convergence can be arbitrarily accelerated depending on the initialization, as measured by an imbalance metric introduced in this work. Finally, we discuss how these insights may generalize to neural networks with sigmoidal activations, showing through a simple example which geometric and dynamical properties persist beyond the linear case.


Compositional Image Synthesis with Inference-Time Scaling

Ji, Minsuk, Lee, Sanghyeok, Ahn, Namhyuk

arXiv.org Artificial Intelligence

ABSTRACT Despite their impressive realism, modern text-to-image models still struggle with compositionality, often failing to render accurate object counts, attributes, and spatial relations. To address this challenge, we present a training-free framework that combines an object-centric approach with self-refinement to improve layout faithfulness while preserving aesthetic quality. Specifically, we leverage large language models (LLMs) to synthesize explicit layouts from input prompts, and we inject these layouts into the image generation process, where a object-centric vision-language model (VLM) judge re-ranks multiple candidates to select the most prompt-aligned outcome iteratively. By unifying explicit layout-grounding with self-refine-based inference-time scaling, our framework achieves stronger scene alignment with prompts compared to recent text-to-image models. Index T erms-- text-to-image synthesis, inference-time-scaling, object-centric 1. INTRODUCTION Text-to-image (T2I) diffusion models now deliver striking realism and diversity from textual prompts [1, 2, 3, 4], yet they still struggle with compositionality: the precise rendering of object counts, attributes, and spatial relations [5].


Separating the what and how of compositional computation to enable reuse and continual learning

Shan, Haozhe, Minni, Sun, Duncker, Lea

arXiv.org Artificial Intelligence

The ability to continually learn, retain and deploy skills to accomplish goals is a key feature of intelligent and efficient behavior. However, the neural mechanisms facilitating the continual learning and flexible (re-)composition of skills remain elusive. Here, we study continual learning and the compositional reuse of learned computations in recurrent neural network (RNN) models using a novel two-system approach: one system that infers what computation to perform, and one that implements how to perform it. We focus on a set of compositional cognitive tasks commonly studied in neuroscience. To construct the what system, we first show that a large family of tasks can be systematically described by a probabilistic generative model, where compositionality stems from a shared underlying vocabulary of discrete task epochs. The shared epoch structure makes these tasks inherently compositional. We first show that this compositionality can be systematically described by a probabilistic generative model. Furthermore, We develop an unsupervised online learning approach that can learn this model on a single-trial basis, building its vocabulary incrementally as it is exposed to new tasks, and inferring the latent epoch structure as a time-varying computational context within a trial. We implement the how system as an RNN whose low-rank components are composed according to the context inferred by the what system. Contextual inference facilitates the creation, learning, and reuse of low-rank RNN components as new tasks are introduced sequentially, enabling continual learning without catastrophic forgetting. Using an example task set, we demonstrate the efficacy and competitive performance of this two-system learning framework, its potential for forward and backward transfer, as well as fast compositional generalization to unseen tasks.