AITopics

Genre: Research Report > New Finding (0.46)

Technology:

Information Technology > Sensing and Signal Processing > Image Processing (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.68)

Neural Information Processing SystemsFeb-7-2026, 23:04:09 GMT

27d52bcb3580724eb4cbe9f2718a9365-Paper.pdf

classification, focus area, scene classification, (15 more...)

Country: Asia > China > Beijing > Beijing (0.04)

Genre: Research Report > New Finding (0.46)

Technology:

Information Technology > Sensing and Signal Processing > Image Processing (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.68)

Neural Information Processing SystemsOct-2-2025, 18:27:54 GMT

Export Reviews, Discussions, Author Feedback and Meta-Reviews

First provide a summary of the paper, and then address the following criteria: Quality, clarity, originality and significance. The authors propose a new large-scale scene image dataset which is 60 times bigger than the current standard SUN dataset. They show that deep networks learned on object centric datasets like ImageNet are not optimal for scene recognition and training similar networks with large amounts of scene images improves their performance substantially. The other way around is also demonstrated empirically i.e. the later features work better on object centric image classification tasks. Overall the paper is well written, addresses an important problem in computer vision.

database, dataset, scene recognition, (12 more...)

Country: North America > Canada > Quebec > Montreal (0.05)

Technology: Information Technology > Artificial Intelligence > Vision (1.00)

Neural Information Processing SystemsSep-30-2025, 08:34:34 GMT

Learning Deep Features for Scene Recognition using Places Database

Scene recognition is one of the hallmark tasks of computer vision, allowing definition of a context for object recognition. Whereas the tremendous recent progress in object recognition tasks is due to the availability of large datasets like ImageNet and the rise of Convolutional Neural Networks (CNNs) for learning high-level features, performance at scene recognition has not attained the same level of success. This may be because current deep features trained from ImageNet are not competitive enough for such tasks. Here, we introduce a new scene-centric database called Places with over 7 million labeled pictures of scenes. We propose new methods to compare the density and diversity of image datasets and show that Places is as dense as other scene datasets and has more diversity. Using CNN, we learn deep features for scene recognition tasks, and establish new state-of-the-art results on several scene-centric datasets. A visualization of the CNN layers' responses allows us to show differences in the internal representations of object-centric and scene-centric networks.

learning deep feature, name change, scene recognition, (5 more...)

Technology:

Information Technology > Artificial Intelligence > Vision (0.84)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.60)

arXiv.org Artificial IntelligenceJul-16-2025

Whom to Respond To? A Transformer-Based Model for Multi-Party Social Robot Interaction

Zhu, He, Miyoshi, Ryo, Okafuji, Yuki

Prior human-robot interaction (HRI) research has primarily focused on single-user interactions, where robots do not need to consider the timing or recipient of their responses. However, in multi-party interactions, such as at malls and hospitals, social robots must understand the context and decide both when and to whom they should respond. In this paper, we propose a Transformer-based multi-task learning framework to improve the decision-making process of social robots, particularly in multi-user environments. Considering the characteristics of HRI, we propose two novel loss functions: one that enforces constraints on active speakers to improve scene modeling, and another that guides response selection towards utterances specifically directed at the robot. Additionally, we construct a novel multi-party HRI dataset that captures real-world complexities, such as gaze misalignment. Experimental results demonstrate that our model achieves state-of-the-art performance in respond decisions, outperforming existing heuristic-based and single-task approaches. Our findings contribute to the development of socially intelligent social robots capable of engaging in natural and context-aware multi-party interactions.

large language model, machine learning, natural language, (18 more...)

2507.1096

Country: Asia > Japan (0.46)

Genre: Research Report > New Finding (0.68)

Industry: Health & Medicine (0.34)

Technology:

Information Technology > Artificial Intelligence > Robots > Robots in the Home (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

arXiv.org Artificial IntelligenceMay-20-2025

Knowledge-enhanced Multi-perspective Video Representation Learning for Scene Recognition

Yu, Xuzheng, Jiang, Chen, Zhang, Wei, Gan, Tian, Chao, Linlin, Zhao, Jianan, Cheng, Yuan, Guo, Qingpei, Chu, Wei

With the explosive growth of video data in real-world applications, a comprehensive representation of videos becomes increasingly important. In this paper, we address the problem of video scene recognition, whose goal is to learn a high-level video representation to classify scenes in videos. Due to the diversity and complexity of video contents in realistic scenarios, this task remains a challenge. Most existing works identify scenes for videos only from visual or textual information in a temporal perspective, ignoring the valuable information hidden in single frames, while several earlier studies only recognize scenes for separate images in a non-temporal perspective. W e argue that these two perspectives are both meaningful for this task and complementary to each other, meanwhile, external introduced knowledge can also promote the comprehension of videos. W e propose a novel two-stream framework to model video representations from multiple perspectives, i.e. temporal and non-temporal perspectives, and integrate the two perspectives in an end-to-end manner by self-distillation. Besides, we design a knowledge-enhanced feature fusion and label prediction method that contributes to naturally introducing knowledge into the task of video scene recognition. Experiments conducted on a real-world dataset demonstrate the effectiveness of our proposed method.

artificial intelligence, machine learning, natural language, (17 more...)

2401.04354

Genre: Research Report (1.00)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Vision (0.95)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.94)

arXiv.org Artificial IntelligenceMar-10-2025

Lightweight Multimodal Artificial Intelligence Framework for Maritime Multi-Scene Recognition

Xi, Xinyu, Yang, Hua, Zhang, Shentai, Liu, Yijie, Sun, Sijin, Fu, Xiuju

Maritime Multi-Scene Recognition is crucial for enhancing the capabilities of intelligent marine robotics, particularly in applications such as marine conservation, environmental monitoring, and disaster response. However, this task presents significant challenges due to environmental interference, where marine conditions degrade image quality, and the complexity of maritime scenes, which requires deeper reasoning for accurate recognition. Pure vision models alone are insufficient to address these issues. To overcome these limitations, we propose a novel multimodal Artificial Intelligence (AI) framework that integrates image data, textual descriptions and classification vectors generated by a Multimodal Large Language Model (MLLM), to provide richer semantic understanding and improve recognition accuracy. Our framework employs an efficient multimodal fusion mechanism to further enhance model robustness and adaptability in complex maritime environments. Experimental results show that our model achieves 98$\%$ accuracy, surpassing previous SOTA models by 3.5$\%$. To optimize deployment on resource-constrained platforms, we adopt activation-aware weight quantization (AWQ) as a lightweight technique, reducing the model size to 68.75MB with only a 0.5$\%$ accuracy drop while significantly lowering computational overhead. This work provides a high-performance solution for real-time maritime scene recognition, enabling Autonomous Surface Vehicles (ASVs) to support environmental monitoring and disaster response in resource-limited settings.

classification vector, modality, recognition, (15 more...)

2503.06978

Country:

Asia > China > Shanghai > Shanghai (0.05)
Asia > Singapore (0.04)

Genre: Research Report > New Finding (0.66)

Industry: Information Technology (0.68)

Technology:

Information Technology > Sensing and Signal Processing > Image Processing (1.00)
Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
(4 more...)

arXiv.org Artificial IntelligenceOct-15-2024

Have the VLMs Lost Confidence? A Study of Sycophancy in VLMs

Li, Shuo, Ji, Tao, Fan, Xiaoran, Lu, Linsheng, Yang, Leyi, Yang, Yuming, Xi, Zhiheng, Zheng, Rui, Wang, Yuran, Zhao, Xiaohui, Gui, Tao, Zhang, Qi, Huang, Xuanjing

In the study of LLMs, sycophancy represents a prevalent hallucination that poses significant challenges to these models. Specifically, LLMs often fail to adhere to original correct responses, instead blindly agreeing with users' opinions, even when those opinions are incorrect or malicious. However, research on sycophancy in visual language models (VLMs) has been scarce. In this work, we extend the exploration of sycophancy from LLMs to VLMs, introducing the MM-SY benchmark to evaluate this phenomenon. We present evaluation results from multiple representative models, addressing the gap in sycophancy research for VLMs. To mitigate sycophancy, we propose a synthetic dataset for training and employ methods based on prompts, supervised fine-tuning, and DPO. Our experiments demonstrate that these methods effectively alleviate sycophancy in VLMs. Additionally, we probe VLMs to assess the semantic impact of sycophancy and analyze the attention distribution of visual tokens. Our findings indicate that the ability to prevent sycophancy is predominantly observed in higher layers of the model. The lack of attention to image knowledge in these higher layers may contribute to sycophancy, and enhancing image attention at high layers proves beneficial in mitigating this issue.

large language model, machine learning, natural language, (19 more...)

2410.11302

Country:

Asia > Middle East > Israel (0.04)
North America > United States > Hawaii > Honolulu County > Honolulu (0.04)
North America > Canada > Ontario > Toronto (0.04)
(3 more...)

Genre: Research Report > New Finding (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.68)

Meißner, Pascal, Dillmann, Rüdiger

Implicit Shape Model Trees: Recognition of 3-D Indoor Scenes and Prediction of Object Poses for Mobile Robots

arXiv.org Artificial IntelligenceOct-20-2023

We present an approach for mobile robots to recognize scenes in object arrangements distributed across cluttered environments. Recognition is enabled by intertwining the robot's search for objects and the assignment of found objects to scenes. Our scene model called "Implicit Shape Model (ISM) trees" allows these two tasks to be solved jointly. This article presents novel algorithms for ISM trees to recognize scenes and predict poses of searched objects. We define scenes as object sets in which some objects are connected via 3-D spatial relations. In previous work, we recognized scenes with single ISMs. However, single ISMs are prone to false positives. As a remedy, we have developed ISM trees, a hierarchical model consisting of multiple ISMs. This article contributes a recognition algorithm that now enables the use of ISM trees for scene recognition. ISM trees should be ideally generated from human demonstrations of object arrangements. As a suitable algorithm was not available, we introduce such a generation algorithm. In line with the active vision paradigm, we combined scene recognition and object search in previous work. However, an efficient algorithm was lacking to make this combination effective. Physical experiments show that this is now overcome with a new algorithm achieving efficient combination through predicted object poses.

ism tree, relation, topology, (14 more...)

doi: 10.3390/robotics12060158

2301.10672

Country:

North America > United States (0.04)
Europe > Germany > Baden-Württemberg > Karlsruhe Region > Karlsruhe (0.04)
Europe > United Kingdom > Scotland > City of Aberdeen > Aberdeen (0.04)

Genre: Research Report (0.40)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Search (1.00)
Information Technology > Artificial Intelligence > Robots > Locomotion (0.60)
Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (0.48)

Ramachandran, Saravanabalagi, Horgan, Jonathan, Sistu, Ganesh, McDonald, John

Fast and Efficient Scene Categorization for Autonomous Driving using VAEs

arXiv.org Artificial IntelligenceOct-26-2022

Scene categorization is a useful precursor task that provides prior knowledge for many advanced computer vision tasks with a broad range of applications in content-based image indexing and retrieval systems. Despite the success of data driven approaches in the field of computer vision such as object detection, semantic segmentation, etc., their application in learning high-level features for scene recognition has not achieved the same level of success. We propose to generate a fast and efficient intermediate interpretable generalized global descriptor that captures coarse features from the image and use a classification head to map the descriptors to 3 scene categories: Rural, Urban and Suburban. We train a Variational Autoencoder in an unsupervised manner and map images to a constrained multi-dimensional latent space and use the latent vectors as compact embeddings that serve as global descriptors for images. The experimental results evidence that the VAE latent vectors capture coarse information from the image, supporting their usage as global descriptors. The proposed global descriptor is very compact with an embedding length of 128, significantly faster to compute, and is robust to seasonal and illuminational changes, while capturing sufficient scene information required for scene categorization.

artificial intelligence, descriptor, machine learning, (15 more...)

doi: 10.56541/SUHE3553

2210.14981

Country:

North America > United States > Utah (0.04)
North America > Canada > Ontario > Toronto (0.04)
Europe > Ireland (0.04)
(7 more...)

Genre: Research Report (0.50)

Industry:

Automobiles & Trucks (0.65)
Transportation > Ground > Road (0.51)
Information Technology > Robotics & Automation (0.41)

Technology:

Information Technology > Sensing and Signal Processing > Image Processing (1.00)
Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)