AITopics | Object-Oriented Architecture

Collaborating Authors

Object-Oriented Architecture

News Overviews Instructional Materials AI-Alerts Classics

Supplementary Material

Neural Information Processing SystemsMay-29-2025, 18:43:57 GMT

We printed a checkerboard with a 9x10 grid of blocks, each measuring 87 mm x 87 mm. Parameter Value Model Architecture Panoptic-PolarNet Test Batch Size 2 Val Batch Size 2 Test Batch size 1 post proc threshold 0.1 post proc nms kernel 5 post proc top k 100 center loss MSE offset loss L1 center loss weight 100 offset loss weight 10 enable SAP True SAP start epoch 30 SAP rate 0.01 Table 3: Parameters for Panoptic Segmentation model Model mIoU (%) Semantic Segmentation Cylinder3D 67.8 Panoptic Segmentation Panoptic-PolarNet 59.5 4D Panoptic Segmentation 4D-StOP 58.8 Table 6: Models of various tasks used in our experiments and their performances on SemanticKITTI The results reveal a significant variance in performance across different categories. The dataset is divided into 17 and 6 categories, respectively. Ground' and'Roads', as opposed to grouping anything related to ground as a single category. Overall, the performance across these tasks underscores the challenges posed by our dataset's With our dataset, future work can focus on improving the model's capacity to handle such diverse The raw data, processed data, and framework code can be found on our website.

category, machine learning, object-oriented architecture, (18 more...)

Neural Information Processing Systems

Technology:

Information Technology > Artificial Intelligence > Machine Learning (0.72)
Information Technology > Artificial Intelligence > Vision (0.71)
Information Technology > Artificial Intelligence > Representation & Reasoning > Object-Oriented Architecture (0.36)

Add feedback

In Pursuit of Causal Label Correlations for Multi-label Image Recognition 3 1

Neural Information Processing SystemsMay-29-2025, 16:37:58 GMT

Multi-label image recognition aims to predict all objects present in an input image. A common belief is that modeling the correlations between objects is beneficial for multi-label recognition. However, this belief has been recently challenged as label correlations may mislead the classifier in testing, due to the possible contextual bias in training. Accordingly, a few of recent works not only discarded label correlation modeling, but also advocated to remove contextual information for multi-label image recognition. This work explicitly explores label correlations for multi-label image recognition based on a principled causal intervention approach. With causal intervention, we pursue causal label correlations and suppress spurious label correlations, as the former tend to convey useful contextual cues while the later may mislead the classifier. Specifically, we decouple label-specific features with a Transformer decoder attached to the backbone network, and model the confounders which may give rise to spurious correlations by clustering spatial features of all training images. Based on label-specific features and confounders, we employ a cross-attention module to implement causal intervention, quantifying the causal correlations from all object categories to each predicted object category. Finally, we obtain image labels by combining the predictions from decoupled features and causal label correlations.

correlation, machine learning, pattern recognition, (19 more...)

Neural Information Processing Systems

Country: Asia > China (0.14)

Genre:

Research Report > Experimental Study (1.00)
Research Report > Strength High (0.68)

Technology:

Information Technology > Sensing and Signal Processing > Image Processing (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Pattern Recognition > Image Matching (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.93)
(3 more...)

Add feedback

Recognize Any Regions Haosen Yang 1 Chuofan Ma2 Bin Wen 3 Yi Jiang 3

Neural Information Processing SystemsMay-29-2025, 16:26:39 GMT

Understanding the semantics of individual regions or patches of unconstrained images, such as open-world object detection, remains a critical yet challenging task in computer vision. Building on the success of powerful image-level visionlanguage (ViL) foundation models like CLIP, recent efforts have sought to harness their capabilities by either training a contrastive model from scratch with an extensive collection of region-label pairs or aligning the outputs of a detection model with image-level representations of region proposals. Despite notable progress, these approaches are plagued by computationally intensive training requirements, susceptibility to data noise, and deficiency in contextual information. To address these limitations, we explore the synergistic potential of off-the-shelf foundation models, leveraging their respective strengths in localization and semantics. We introduce a novel, generic, and efficient architecture, named RegionSpot, designed to integrate position-aware localization knowledge from a localization foundation model (e.g., SAM) with semantic information from a ViL model (e.g., CLIP). To fully exploit pretrained knowledge while minimizing training overhead, we keep both foundation models frozen, focusing optimization efforts solely on a lightweight attention-based knowledge integration module. Extensive experiments in open-world object recognition show that our RegionSpot achieves significant performance gain over prior alternatives, along with substantial computational savings (e.g., training our model with 3 million data in a single day using 8 V100 GPUs). RegionSpot outperforms GLIP-L by 2.9 in mAP on LVIS val set, with an even larger margin of 13.1 AP for more challenging and rare categories, and a 2.5 AP increase on ODinW. Furthermore, it exceeds GroundingDINO-L by 11.0 AP for rare categories on the LVIS minival set.

large language model, machine learning, natural language, (17 more...)

Neural Information Processing Systems

Country: Europe > Switzerland > Zürich > Zürich (0.14)

Genre:

Research Report > Experimental Study (0.93)
Research Report > New Finding (0.67)

Industry: Education (0.48)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Object-Oriented Architecture (0.67)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.48)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.46)

Add feedback

Multi-Object Hallucination in Vision Language Models Xuweiyi Chen 1,2 Ziqiao Ma

Neural Information Processing SystemsMay-29-2025, 12:03:04 GMT

Large vision language models (LVLMs) often suffer from object hallucination, producing objects not present in the given images. While current benchmarks for object hallucination primarily concentrate on the presence of a single object class rather than individual entities, this work systematically investigates multi-object hallucination, examining how models misperceive (e.g., invent nonexistent objects or become distracted) when tasked with focusing on multiple objects simultaneously. We introduce Recognition-based Object Probing Evaluation (ROPE), an automated evaluation protocol that considers the distribution of object classes within a single image during testing and uses visual referring prompts to eliminate ambiguity. With comprehensive empirical studies and analysis of potential factors leading to multi-object hallucination, we found that (1) LVLMs suffer more hallucinations when focusing on multiple objects compared to a single object.

large language model, machine learning, natural language, (21 more...)

Neural Information Processing Systems

Country:

North America > United States (0.45)
Europe > Switzerland > Zürich > Zürich (0.14)

Genre:

Research Report > New Finding (1.00)
Research Report > Experimental Study (1.00)

Industry: Transportation > Ground > Road (0.69)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Natural Language > Chatbot (1.00)
(2 more...)

Add feedback

Generating Compositional Scenes via Text-to-image RGBA Instance Generation Petru-Daniel Tudosiu Yongxin Yang University of Edinburgh Huawei Noah's Ark Lab Huawei Noah's Ark Lab Shifeng Zhang

Neural Information Processing SystemsMay-29-2025, 11:43:56 GMT

Text-to-image diffusion generative models can generate high quality images at the cost of tedious prompt engineering. Controllability can be improved by introducing layout conditioning, however existing methods lack layout editing ability and finegrained control over object attributes. The concept of multi-layer generation holds great potential to address these limitations, however generating image instances concurrently to scene composition limits control over fine-grained object attributes, relative positioning in 3D space and scene manipulation abilities. In this work, we propose a novel multi-stage generation paradigm that is designed for fine-grained control, flexibility and interactivity. To ensure control over instance attributes, we devise a novel training paradigm to adapt a diffusion model to generate isolated scene components as RGBA images with transparency information. To build complex images, we employ these pre-generated instances and introduce a multilayer composite generation process that smoothly assembles components in realistic scenes. Our experiments show that our RGBA diffusion model is capable of generating diverse and high quality instances with precise control over object attributes. Through multi-layer composition, we demonstrate that our approach allows to build and manipulate images from highly complex prompts with finegrained control over object appearance and location, granting a higher degree of control than competing methods.

diffusion model, machine learning, natural language, (19 more...)

Neural Information Processing Systems

Country: Europe > Germany (0.14)

Genre: Research Report > Experimental Study (0.93)

Industry:

Telecommunications (0.76)
Information Technology > Security & Privacy (0.67)

Technology:

Information Technology > Sensing and Signal Processing > Image Processing (1.00)
Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)
(2 more...)

Add feedback

Bridge the Points: Graph-based Few-shot Segment Anything Semantically

Neural Information Processing SystemsMay-29-2025, 04:37:43 GMT

The recent advancements in large-scale pre-training techniques have significantly enhanced the capabilities of vision foundation models, notably the Segment Anything Model (SAM), which can generate precise masks based on point and box prompts. Recent studies extend SAM to Few-shot Semantic Segmentation (FSS), focusing on prompt generation for SAM-based automatic semantic segmentation. However, these methods struggle with selecting suitable prompts, require specific hyperparameter settings for different scenarios, and experience prolonged one-shot inference time due to the overuse of SAM, resulting in low efficiency and limited automation ability. To address these issues, we propose a simple yet effective approach based on graph analysis. In particular, a Positive-Negative Alignment module dynamically selects the point prompts for generating masks, especially uncovering the potential of the background context as the negative reference. Another subsequent Point-Mask Clustering module aligns the granularity of masks and selected points as a directed graph, based on mask coverage over points. These points are then aggregated by decomposing the weakly connected components of the directed graph in an efficient manner, constructing distinct natural clusters.

machine learning, natural language, segmentation, (17 more...)

Neural Information Processing Systems

Country: Asia > China (0.14)

Genre: Research Report > Experimental Study (0.93)

Technology:

Information Technology > Sensing and Signal Processing > Image Processing (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)
(2 more...)

Add feedback

290141d6bfd7ea4d3f4483d126609bf6-Supplemental-Conference.pdf

Neural Information Processing SystemsMay-29-2025, 00:58:06 GMT

artificial intelligence, dimension, object-oriented architecture, (17 more...)

Neural Information Processing Systems

Genre: Research Report (0.35)

Technology: Information Technology > Artificial Intelligence > Representation & Reasoning > Object-Oriented Architecture (0.41)

Add feedback

28538c394c36e4d5ea8ff5ad60562a93-Supplemental.pdf

Neural Information Processing SystemsMay-28-2025, 19:13:50 GMT

artificial intelligence, machine learning, object-oriented architecture, (16 more...)

Neural Information Processing Systems

Technology:

Information Technology > Artificial Intelligence > Machine Learning (0.97)
Information Technology > Artificial Intelligence > Representation & Reasoning > Object-Oriented Architecture (0.40)

Add feedback

Personalized Instance-based Navigation Toward User-Specific Objects in Realistic Environments Supplemental Material

Neural Information Processing SystemsMay-28-2025, 13:23:26 GMT

A limitation of this work is related to the visual appearance of some of the object instances in the PInNED dataset. For example, the Habitat simulator's [61] rendering can cause a deterioration in the texture quality of some objects, failing to accurately reproduce them in the environment. Moreover, instances with very small or detailed components can also exhibit a degradation in their visual fidelity when instantiated in the simulator. Consequently, as the agent moves farther from these objects, their details become less discernible. As a direct consequence, detecting small target objects is a critical challenge for navigation agents tackling the PIN task. This behavior is showcased in Sec. E, where agents tackling the PIN task in the episodes of PInNED dataset face significant challenges in successfully detecting instances of inherently small object categories. In fact, despite agents such as the modular agent with DINOv2 [51] showcase good performance on the overall PIN task, detecting small objects represents one of the main limitations of current object-driven agents, as they can only be recognized when the robot is close to them. A possible future improvement could involve designing novel exploration policies that aim to bring the robot closer to surfaces where the target might be placed while leveraging different detection criteria that take into consideration the scale of the observed objects. The introduction of the Personalized Instance-based Navigation (PIN) task and the accompanying PInNED dataset has the potential to advance the field of visual navigation and Embodied AI. The PIN task fills the limitations of the current datasets for embodied navigation by requiring agents to distinguish between multiple instances of objects from the same category, thereby enhancing their precision and robustness in real-world scenarios. This advancement can lead to more capable and reliable robotic assistants and autonomous systems, especially in household settings.

artificial intelligence, machine learning, object-oriented architecture, (18 more...)

Neural Information Processing Systems

Genre: Research Report (0.68)

Technology:

Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Object-Oriented Architecture (0.51)
Information Technology > Artificial Intelligence > Representation & Reasoning > Agents (0.48)

Add feedback

Personalized Instance-based Navigation Toward User-Specific Objects in Realistic Environments

Neural Information Processing SystemsMay-28-2025, 13:23:21 GMT

In the last years, the research interest in visual navigation towards objects in indoor environments has grown significantly. This growth can be attributed to the recent availability of large navigation datasets in photo-realistic simulated environments, like Gibson and Matterport3D. However, the navigation tasks supported by these datasets are often restricted to the objects present in the environment at acquisition time. Also, they fail to account for the realistic scenario in which the target object is a user-specific instance that can be easily confused with similar objects and may be found in multiple locations within the environment. To address these limitations, we propose a new task denominated Personalized Instance-based Navigation (PIN), in which an embodied agent is tasked with locating and reaching a specific personal object by distinguishing it among multiple instances of the same category. The task is accompanied by PInNED, a dedicated new dataset composed of photo-realistic scenes augmented with additional 3D objects. In each episode, the target object is presented to the agent using two modalities: a set of visual reference images on a neutral background and manually annotated textual descriptions. Through comprehensive evaluations and analyses, we showcase the challenges of the PIN task as well as the performance and shortcomings of currently available methods designed for object-driven navigation, considering modular and end-to-end agents. Where is my Teddy Bear?

machine learning, natural language, object-oriented architecture, (15 more...)

Neural Information Processing Systems

Country: Europe > Italy (0.14)

Genre: Research Report > Experimental Study (0.93)

Technology: