AITopics

Technology:

Information Technology > Artificial Intelligence > Vision (0.87)
Information Technology > Artificial Intelligence > Machine Learning (0.71)
Information Technology > Artificial Intelligence > Natural Language > Text Processing (0.42)

Neural Information Processing SystemsFeb-13-2026, 17:57:28 GMT

Stacked Semantics-Guided Attention Model for Fine-Grained Zero-Shot Learning

yunlong yu, Zhong Ji, Yanwei Fu, Jichang Guo, Yanwei Pang, Zhongfei (Mark) Zhang

Neural Information Processing Systems http://nips.cc/

class semantic description, class semantic feature, dataset, (13 more...)

Country:

North America > United States > New York > Broome County > Binghamton (0.04)
Asia > China > Tianjin Province > Tianjin (0.04)
North America > Canada > Quebec > Montreal (0.04)

Genre: Research Report (0.68)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Natural Language > Machine Translation (0.68)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.47)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.45)

Neural Information Processing SystemsFeb-9-2026, 11:13:45 GMT

792dd774336314c3c27a04bb260cf2cf-Supplemental.pdf

Finally,we train our model for 8hours on asingle V100GPU. We provide an illustration of our weakly supervised phrase grounding model in Figure 4b (this supplemental). Specifically,we create context-preserving negativecaptions for an image by substituting anoun in its original caption with negativenouns, that are sampled from apretrained BERT [17] model. Forexample,inthecase where only one cross-attention layer is used, adding the sentence-level contrastive loss leads to a 2.5%intheR@1accuracy. These videos contain transcribed narrations thatareeither uploaded manually byusersor aretheoutputofanautomatic speech recognition (ASR) system.

artificial intelligence, machine learning, natural language, (19 more...)

Technology:

Information Technology > Artificial Intelligence > Natural Language (0.52)
Information Technology > Artificial Intelligence > Machine Learning (0.47)

Neural Information Processing SystemsNov-20-2025, 18:21:26 GMT

Stacked Semantics-Guided Attention Model for Fine-Grained Zero-Shot Learning

yunlong yu, Zhong Ji, Yanwei Fu, Jichang Guo, Yanwei Pang, Zhongfei (Mark) Zhang

Neural Information Processing Systems http://nips.cc/

class semantic feature, large language model, machine learning, (18 more...)

Country:

North America > United States > New York > Broome County > Binghamton (0.04)
Asia > China > Tianjin Province > Tianjin (0.04)
North America > Canada > Quebec > Montreal (0.04)

Genre: Research Report (0.68)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Natural Language > Machine Translation (0.68)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.48)
(2 more...)

arXiv.org Artificial IntelligenceOct-1-2025

VLM-FO1: Bridging the Gap Between High-Level Reasoning and Fine-Grained Perception in VLMs

Liu, Peng, Shen, Haozhan, Fang, Chunxin, Sun, Zhicheng, Liao, Jiajia, Zhao, Tiancheng

Vision-Language Models (VLMs) excel at high-level scene understanding but falter on fine-grained perception tasks requiring precise localization. This failure stems from a fundamental mismatch, as generating exact numerical coordinates is a challenging task for language-centric architectures. In this paper, we introduce VLM-FO1, a novel framework that overcomes this limitation by reframing object-centric perception from a brittle coordinate generation problem into a robust feature retrieval task. Our method operates as a plug-and-play module that integrates with any pre-trained VLM. It leverages a Hybrid Fine-grained Region Encoder (HFRE), featuring a dual vision encoder, to generate powerful region tokens rich in both semantic and spatial detail. A token-based referencing system then enables the LLM to seamlessly reason about and ground language in these specific visual regions. Experiments show that VLM-FO1 achieves state-of-the-art performance across a diverse suite of benchmarks, demonstrating exceptional capabilities in object grounding, region generational understanding, and visual region reasoning. Crucially, our two-stage training strategy ensures that these perception gains are achieved without compromising the base model's general visual understanding capabilities. VLM-FO1 establishes an effective and flexible paradigm for building perception-aware VLMs, bridging the gap between high-level reasoning and fine-grained visual grounding.

arxiv preprint arxiv, large language model, machine learning, (16 more...)

2509.25916

Genre: Research Report (0.82)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Natural Language > Chatbot (0.93)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.93)

Neural Information Processing SystemsAug-15-2025, 07:53:21 GMT

A Supplementary

In this supplementary material, we provide the following additions to the main submission: A.1. We use ReLU as the activation function. We provide an illustration of our weakly supervised phrase grounding model in Figure 4b (this supplemental). To incorporate our proposed CoMMA into the model of Gupta et al . Finally, the sentence loss is weighted by a hyperparameter.

annotation, cross-attention layer, interaction, (16 more...)

Technology:

Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.34)

Katsumata, Kei, Kambara, Motonari, Yashima, Daichi, Korekata, Ryosuke, Sugiura, Komei

Mobile Manipulation Instruction Generation from Multiple Images with Automatic Metric Enhancement

arXiv.org Artificial IntelligenceJan-28-2025

Abstract-- We consider the problem of generating free-form mobile manipulation instructions based on a target object image and receptacle image. Conventional image captioning models are not able to generate appropriate instructions because their architectures are typically optimized for single-image. In this study, we propose a model that handles both the target object and receptacle to generate free-form instruction sentences for mobile manipulation tasks. Moreover, we introduce a novel training method that effectively incorporates the scores from both learning-based and n-gram based automatic evaluation metrics as rewards. This method enables the model to learn the co-occurrence relationships between words and appropriate paraphrases. Therefore, models are required to appropriately handle both images. Hence, these methods are inappropriate essential in a variety of contexts such as elderly care facilities for generating mobile manipulation instructions based on and daily support for disabilities. In particular, the integration multiple images. of service robots in elderly care facilities significantly We propose a model that generates mobile manipulation reduces the burden on caregivers and addresses the growing instructions using a target object image and a receptacle demand driven by the rise in the elderly population.

large language model, machine learning, natural language, (19 more...)

2501.17022

Country: Asia > Japan > Honshū > Kantō > Kanagawa Prefecture > Yokohama (0.04)

Genre: Research Report > New Finding (0.66)

Technology:

Information Technology > Artificial Intelligence > Robots (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.70)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.47)

arXiv.org Artificial IntelligenceJul-28-2024

Urban Traffic Accident Risk Prediction Revisited: Regionality, Proximity, Similarity and Sparsity

Chen, Minxiao, Yuan, Haitao, Jiang, Nan, Bao, Zhifeng, Wang, Shangguang

Traffic accidents pose a significant risk to human health and property safety. Therefore, to prevent traffic accidents, predicting their risks has garnered growing interest. We argue that a desired prediction solution should demonstrate resilience to the complexity of traffic accidents. In particular, it should adequately consider the regional background, accurately capture both spatial proximity and semantic similarity, and effectively address the sparsity of traffic accidents. However, these factors are often overlooked or difficult to incorporate. In this paper, we propose a novel multi-granularity hierarchical spatio-temporal network. Initially, we innovate by incorporating remote sensing data, facilitating the creation of hierarchical multi-granularity structure and the comprehension of regional background. We construct multiple high-level risk prediction tasks to enhance model's ability to cope with sparsity. Subsequently, to capture both spatial proximity and semantic similarity, region feature and multi-view graph undergo encoding processes to distill effective representations. Additionally, we propose message passing and adaptive temporal attention module that bridges different granularities and dynamically captures time correlations inherent in traffic accident patterns. At last, a multivariate hierarchical loss function is devised considering the complexity of the prediction purpose. Extensive experiments on two real datasets verify the superiority of our model against the state-of-the-art methods.

artificial intelligence, machine learning, natural language, (18 more...)

doi: 10.1145/3627673.3679567

2407.19668

Country:

North America > United States > Illinois > Cook County > Chicago (0.06)
North America > United States > Idaho > Ada County > Boise (0.05)
Asia > China > Beijing > Beijing (0.05)
(6 more...)

Genre: Research Report > Promising Solution (0.34)

Industry:

Health & Medicine (0.87)
Transportation > Ground > Road (0.68)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Spatial Reasoning (0.87)
Information Technology > Artificial Intelligence > Natural Language > Text Processing (0.70)

arXiv.org Artificial IntelligenceApr-15-2024

Design and Analysis of Efficient Attention in Transformers for Social Group Activity Recognition

Tamura, Masato

Social group activity recognition is a challenging task extended from group activity recognition, where social groups must be recognized with their activities and group members. Existing methods tackle this task by leveraging region features of individuals following existing group activity recognition methods. However, the effectiveness of region features is susceptible to person localization and variable semantics of individual actions. To overcome these issues, we propose leveraging attention modules in transformers to generate social group features. In this method, multiple embeddings are used to aggregate features for a social group, each of which is assigned to a group member without duplication. Due to this non-duplicated assignment, the number of embeddings must be significant to avoid missing group members and thus renders attention in transformers ineffective. To find optimal attention designs with a large number of embeddings, we explore several design choices of queries for feature aggregation and self-attention modules in transformer decoders. Extensive experimental results show that the proposed method achieves state-of-the-art performance and verify that the proposed attention designs are highly effective on social group activity recognition.

activity recognition, group activity recognition, recognition, (15 more...)

2404.09964

Country: North America > United States > California (0.04)

Genre: Research Report > New Finding (0.66)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Data Science (0.67)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)

arXiv.org Artificial IntelligenceDec-7-2023

Urban Region Representation Learning with Attentive Fusion

Sun, Fengze, Qi, Jianzhong, Chang, Yanchuan, Fan, Xiaoliang, Karunasekera, Shanika, Tanin, Egemen

An increasing number of related urban data sources have brought forth novel opportunities for learning urban region representations, i.e., embeddings. The embeddings describe latent features of urban regions and enable discovering similar regions for urban planning applications. Existing methods learn an embedding for a region using every different type of region feature data, and subsequently fuse all learned embeddings of a region to generate a unified region embedding. However, these studies often overlook the significance of the fusion process. The typical fusion methods rely on simple aggregation, such as summation and concatenation, thereby disregarding correlations within the fused region embeddings. To address this limitation, we propose a novel model named HAFusion. Our model is powered by a dual-feature attentive fusion module named DAFusion, which fuses embeddings from different region features to learn higher-order correlations between the regions as well as between the different types of region features. DAFusion is generic - it can be integrated into existing models to enhance their fusion process. Further, motivated by the effective fusion capability of an attentive module, we propose a hybrid attentive feature learning module named HALearning to enhance the embedding learning from each individual type of region features. Extensive experiments on three real-world datasets demonstrate that our model HAFusion outperforms state-of-the-art methods across three different prediction tasks. Using our learned region embedding leads to consistent and up to 31% improvements in the prediction accuracy.

correlation, matrix, module, (16 more...)