Goto

Collaborating Authors

 Object-Oriented Architecture


Compositional Generalization from First Principles Matthias Bethge 1,2 Wieland Brendel

Neural Information Processing Systems

Leveraging the compositional nature of our world to expedite learning and facilitate generalization is a hallmark of human perception. In machine learning, on the other hand, achieving compositional generalization has proven to be an elusive goal, even for models with explicit compositional priors. To get a better handle on compositional generalization, we here approach it from the bottom up: Inspired by identifiable representation learning, we investigate compositionality as a property of the data-generating process rather than the data itself. This reformulation enables us to derive mild conditions on only the support of the training distribution and the model architecture, which are sufficient for compositional generalization. We further demonstrate how our theoretical framework applies to real-world scenarios and validate our findings empirically. Our results set the stage for a principled theoretical study of compositional generalization.


Data Spatial Programming

arXiv.org Artificial Intelligence

We introduce a novel programming model, Data Spatial Programming, which extends the semantics of Object-Oriented Programming (OOP) by introducing new class-like constructs called archetypes. These archetypes encapsulate spatial relationships between data entities and execution flow in a structured manner, enabling more expressive and semantically rich computations over interconnected data structures. By formalizing the relationships between data elements in space, our approach allows for more intuitive modeling of complex systems where the topology of connections is essential to the underlying computational model. This paradigm addresses limitations in traditional OOP when representing dynamically evolving networks, agent-based systems, and other spatially-oriented computational problems.


Personalized Instance-based Navigation Toward User-Specific Objects in Realistic Environments Supplemental Material

Neural Information Processing Systems

A limitation of this work is related to the visual appearance of some of the object instances in the PInNED dataset. For example, the Habitat simulator's [61] rendering can cause a deterioration in the texture quality of some objects, failing to accurately reproduce them in the environment. Moreover, instances with very small or detailed components can also exhibit a degradation in their visual fidelity when instantiated in the simulator. Consequently, as the agent moves farther from these objects, their details become less discernible. As a direct consequence, detecting small target objects is a critical challenge for navigation agents tackling the PIN task. This behavior is showcased in Sec. E, where agents tackling the PIN task in the episodes of PInNED dataset face significant challenges in successfully detecting instances of inherently small object categories. In fact, despite agents such as the modular agent with DINOv2 [51] showcase good performance on the overall PIN task, detecting small objects represents one of the main limitations of current object-driven agents, as they can only be recognized when the robot is close to them. A possible future improvement could involve designing novel exploration policies that aim to bring the robot closer to surfaces where the target might be placed while leveraging different detection criteria that take into consideration the scale of the observed objects. The introduction of the Personalized Instance-based Navigation (PIN) task and the accompanying PInNED dataset has the potential to advance the field of visual navigation and Embodied AI. The PIN task fills the limitations of the current datasets for embodied navigation by requiring agents to distinguish between multiple instances of objects from the same category, thereby enhancing their precision and robustness in real-world scenarios. This advancement can lead to more capable and reliable robotic assistants and autonomous systems, especially in household settings.


Personalized Instance-based Navigation Toward User-Specific Objects in Realistic Environments

Neural Information Processing Systems

In the last years, the research interest in visual navigation towards objects in indoor environments has grown significantly. This growth can be attributed to the recent availability of large navigation datasets in photo-realistic simulated environments, like Gibson and Matterport3D. However, the navigation tasks supported by these datasets are often restricted to the objects present in the environment at acquisition time. Also, they fail to account for the realistic scenario in which the target object is a user-specific instance that can be easily confused with similar objects and may be found in multiple locations within the environment. To address these limitations, we propose a new task denominated Personalized Instance-based Navigation (PIN), in which an embodied agent is tasked with locating and reaching a specific personal object by distinguishing it among multiple instances of the same category. The task is accompanied by PInNED, a dedicated new dataset composed of photo-realistic scenes augmented with additional 3D objects. In each episode, the target object is presented to the agent using two modalities: a set of visual reference images on a neutral background and manually annotated textual descriptions. Through comprehensive evaluations and analyses, we showcase the challenges of the PIN task as well as the performance and shortcomings of currently available methods designed for object-driven navigation, considering modular and end-to-end agents. Where is my Teddy Bear?


A Separation model architecture

Neural Information Processing Systems

In Table 2, we describe the separation network architecture using a TDCN++ [21]. As compared to the original Conv-TasNet method [29], the changes to the model include the following: Instead of global layer norm, which averages statistics over frames and channels, the TDCN++ uses instance norm, also known as feature-wise global layer norm [21]. This mean-and-variance normalization is performed separately for each convolution channel across frames, with trainable scalar bias and scale parameters. The second difference is skip-residual connections from the outputs of earlier residual blocks to form the inputs of the later residual blocks. A skip-residual connection includes a transformation in the form of a dense layer with bias of the block outputs and all paths from residual connections are summed with the regular block input coming from the previous block.


1 Supplement 1.1 Model Architectures

Neural Information Processing Systems

Figure 1: Model Architectures for Latent Integration Using a latent vector of dimension k, our multiplicative model is able to learn k interpretations of the observation, which are each modulated by a dimension of the latent vector. A skip connection allows the model to learn policies faster than without. As a baseline, we use a concatenation model, in which the latent vector z is concatenated with the environment observation at each timestep. In both cases, by setting corresponding model weights to zero, a learned policy could completely ignore the latent vector to yield a standard RL policy architecture. In practice, since k and d are small (k = 3 and d {16, 32, 64}) in our experiments, the increase in computational cost is not significant.


Voxel-based 3D Detection and Reconstruction of Multiple Objects from a Single Image

Neural Information Processing Systems

Inferring 3D locations and shapes of multiple objects from a single 2D image is a long-standing objective of computer vision. Most of the existing works either predict one of these 3D properties or focus on solving both for a single object. One fundamental challenge lies in how to learn an effective representation of the image that is well-suited for 3D detection and reconstruction. In this work, we propose to learn a regular grid of 3D voxel features from the input image which is aligned with 3D scene space via a 3D feature lifting operator. Based on the 3D voxel features, our novel CenterNet-3D detection head formulates the 3D detection as keypoint detection in the 3D space. Moreover, we devise an efficient coarse-to-fine reconstruction module, including coarse-level voxelization and a novel local PCA-SDF shape representation, which enables fine detail reconstruction and one order of magnitude faster inference than prior methods. With complementary supervision from both 3D detection and reconstruction, one enables the 3D voxel features to be geometry and context preserving, benefiting both tasks. The effectiveness of our approach is demonstrated through 3D detection and reconstruction in single object and multiple object scenarios. Code is available at http://cvlab.cse.


VastTrack: Vast Category Visual Object Tracking

Neural Information Processing Systems

In this paper, we propose a novel benchmark, named VastTrack, aiming to facilitate the development of general visual tracking via encompassing abundant classes and videos. VastTrack consists of a few attractive properties: (1) Vast Object Category. In particular, it covers targets from 2,115 categories, significantly surpassing object classes of existing popular benchmarks (e.g., GOT-10k with 563 classes and LaSOT with 70 categories). Through providing such vast object classes, we expect to learn more general object tracking. Compared with current benchmarks, VastTrack provides 50,610 videos with 4.2 million frames, which makes it to date the largest dataset in term of the number of videos, and hence could benefit training even more powerful visual trackers in the deep learning era.


Articulate your NeRF: Unsupervised articulated object modeling via conditional view synthesis

Neural Information Processing Systems

We propose a novel unsupervised method to learn pose and part-segmentation of articulated objects with rigid parts. Given two observations of an object in different articulation states, our method learns the geometry and appearance of object parts by using an implicit model from the first observation, distills the part segmentation and articulation from the second observation while rendering the latter observation. Additionally, to tackle the complexities in the joint optimization of part segmentation and articulation, we propose a voxel grid based initialization strategy and a decoupled optimization procedure. Compared to the prior unsupervised work, our model obtains significantly better performance, generalizes to objects with multiple parts while it can be efficiently from few views for the latter observation.


Evaluating the Application of SOLID Principles in Modern AI Framework Architectures

arXiv.org Artificial Intelligence

This research evaluates the extent to which modern AI frameworks, specifically TensorFlow and scikit-learn, adhere to the SOLID design principles - Single Responsibility, Open/Closed, Liskov Substitution, Interface Segregation, and Dependency Inversion. Analyzing the frameworks architectural documentation and design philosophies, this research investigates architectural trade-offs when balancing software engineering best practices with AI-specific needs. I examined each frameworks documentation, source code, and architectural components to evaluate their adherence to these principles. The results show that both frameworks adopt certain aspects of SOLID design principles but make intentional trade-offs to address performance, scalability, and the experimental nature of AI development. TensorFlow focuses on performance and scalability, sometimes sacrificing strict adherence to principles like Single Responsibility and Interface Segregation. While scikit-learns design philosophy aligns more closely with SOLID principles through consistent interfaces and composition principles, sticking closer to SOLID guidelines but with occasional deviations for performance optimizations and scalability. This research discovered that applying SOLID principles in AI frameworks depends on context, as performance, scalability, and flexibility often require deviations from traditional software engineering principles. This research contributes to understanding how domain-specific constraints influence architectural decisions in modern AI frameworks and how these frameworks strategically adapted design choices to effectively balance these contradicting requirements.