AITopics | perceiver

Collaborating Authors

perceiver

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

Asking like Socrates: Socrates helps VLMs understand remote sensing images

Shao, Run, Li, Ziyu, Zhang, Zhaoyang, Xu, Linrui, He, Xinran, Yuan, Hongyuan, He, Bolei, Dai, Yongxing, Yan, Yiming, Chen, Yijun, Guo, Wang, Li, Haifeng

arXiv.org Artificial IntelligenceDec-1-2025

Recent multimodal reasoning models, inspired by DeepSeek-R1, have significantly advanced vision-language systems. However, in remote sensing (RS) tasks, we observe widespread pseudo reasoning: models narrate the process of reasoning rather than genuinely reason toward the correct answer based on visual evidence. We attribute this to the Glance Effect, where a single, coarse perception of large-scale RS imagery results in incomplete understanding and reasoning based on linguistic self-consistency instead of visual evidence. To address this, we propose RS-EoT (Remote Sensing Evidence-of-Thought), a language-driven, iterative visual evidence-seeking paradigm. To instill this paradigm, we propose SocraticAgent, a self-play multi-agent system that synthesizes reasoning traces via alternating cycles of reasoning and visual inspection. To enhance and generalize these patterns, we propose a two-stage progressive RL strategy: first, RL on fine-grained Grounding tasks to enhance RS-EoT capabilities, followed by RL on RS VQA to generalize to broader understanding scenarios. Experiments show RS-EoT achieves state-of-the-art performance on multiple RS VQA and grounding benchmarks. Analyses reveal clear iterative cycles of reasoning and evidence seeking, confirming RS-EoT mitigates the Glance Effect and enables genuine evidence-grounded reasoning. Our code, data, and models are available at https://geox-lab.github.io/Asking_like_Socrates

large language model, machine learning, natural language, (20 more...)

arXiv.org Artificial Intelligence

2511.22396

Country:

Asia > China (0.46)
Europe > Austria (0.28)

Genre: Research Report > New Finding (0.46)

Industry: Energy > Renewable > Geothermal > Geothermal Energy Exploration and Development > Geophysical Analysis & Survey (0.96)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Agents (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.67)

Add feedback

Cognition-of-Thought Elicits Social-Aligned Reasoning in Large Language Models

Zhang, Xuanming, Chen, Yuxuan, Yeh, Samuel, Li, Sharon

arXiv.org Artificial IntelligenceOct-15-2025

Large language models (LLMs) excel at complex reasoning but can still exhibit harmful behaviors. Current alignment strategies typically embed safety into model weights, making these controls implicit, static, and difficult to modify. This paper introduces Cognition-of-Thought (CooT), a novel decoding-time framework that equips LLMs with an explicit cognitive self-monitoring loop. CooT couples a standard text Generator with a cognitive Perceiver that continuously monitors the unfolding sequence. The Perceiver uses a structured, precedence-based hierarchy of principles (e.g., safety over obedience) to detect potential misalignments as they arise. When violations are flagged, CooT intervenes by rolling back the generation to the point of error and regenerating under injected guidance that combines universal social priors with context-specific warnings. CooT thus transforms alignment from a fixed property into an explicit, dynamic, and auditable process active during inference, allowing for flexible policy updates without retraining the model. Extensive experiments across multiple benchmarks and model families confirm that CooT consistently improves safety and social reasoning performance.

coot, large language model, machine learning, (21 more...)

arXiv.org Artificial Intelligence

2509.23441

Country:

North America > United States (0.45)
Europe (0.27)

Genre:

Research Report > New Finding (0.45)
Research Report > Experimental Study (0.45)

Industry:

Law Enforcement & Public Safety > Crime Prevention & Enforcement (1.00)
Law (1.00)
Information Technology > Security & Privacy (1.00)
(6 more...)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.94)

Add feedback

Which Consciousness Can Be Artificialized? Local Percept-Perceiver Phenomenon for the Existence of Machine Consciousness

Singh, Shri Lal Raghudev Ram

arXiv.org Artificial IntelligenceJun-25-2025

This paper presents a novel paradigm of the local percept-perceiver phenomenon to formalize certain observations in neuroscientific theories of consciousness. Using this model, a set-theoretic formalism is developed for artificial systems, and the existence of machine consciousness is proved by invoking Zermelo-Fraenkel set theory. The article argues for the possibility of a reductionist form of epistemic consciousness within machines.

artificial intelligence, consciousness, perceiver, (13 more...)

arXiv.org Artificial Intelligence

2506.18935

Country:

North America > United States (0.28)
Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.14)

Genre: Research Report (0.70)

Industry: Health & Medicine > Therapeutic Area > Neurology (1.00)

Technology:

Information Technology > Artificial Intelligence > Cognitive Science (0.96)
Information Technology > Artificial Intelligence > Issues > Philosophy (0.35)

Add feedback

RoboGround: Robotic Manipulation with Grounded Vision-Language Priors

Huang, Haifeng, Chen, Xinyi, Chen, Yilun, Li, Hao, Han, Xiaoshen, Wang, Zehan, Wang, Tai, Pang, Jiangmiao, Zhao, Zhou

arXiv.org Artificial IntelligenceMay-1-2025

Recent advancements in robotic manipulation have highlighted the potential of intermediate representations for improving policy generalization. In this work, we explore grounding masks as an effective intermediate representation, balancing two key advantages: (1) effective spatial guidance that specifies target objects and placement areas while also conveying information about object shape and size, and (2) broad generalization potential driven by large-scale vision-language models pretrained on diverse grounding datasets. We introduce RoboGround, a grounding-aware robotic manipulation system that leverages grounding masks as an intermediate representation to guide policy networks in object manipulation tasks. To further explore and enhance generalization, we propose an automated pipeline for generating large-scale, simulated data with a diverse set of objects and instructions. Extensive experiments show the value of our dataset and the effectiveness of grounding masks as intermediate guidance, significantly enhancing the generalization abilities of robot policies.

artificial intelligence, large language model, natural language, (17 more...)

arXiv.org Artificial Intelligence

2504.2153

Genre: Research Report > New Finding (0.67)

Technology:

Information Technology > Artificial Intelligence > Robots (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)

Add feedback

MP-GUI: Modality Perception with MLLMs for GUI Understanding

Wang, Ziwei, Chen, Weizhi, Yang, Leyang, Zhou, Sheng, Zhao, Shengchu, Zhan, Hanbei, Jin, Jiongchao, Li, Liangcheng, Shao, Zirui, Bu, Jiajun

arXiv.org Artificial IntelligenceMar-18-2025

Graphical user interface (GUI) has become integral to modern society, making it crucial to be understood for human-centric systems. However, unlike natural images or documents, GUIs comprise artificially designed graphical elements arranged to convey specific semantic meanings. Current multi-modal large language models (MLLMs) already proficient in processing graphical and textual components suffer from hurdles in GUI understanding due to the lack of explicit spatial structure modeling. Moreover, obtaining high-quality spatial structure data is challenging due to privacy issues and noisy environments. To address these challenges, we present MP-GUI, a specially designed MLLM for GUI understanding. MP-GUI features three precisely specialized perceivers to extract graphical, textual, and spatial modalities from the screen as GUI-tailored visual clues, with spatial structure refinement strategy and adaptively combined via a fusion gate to meet the specific preferences of different GUI understanding tasks. To cope with the scarcity of training data, we also introduce a pipeline for automatically data collecting. Extensive experiments demonstrate that MP-GUI achieves impressive results on various GUI understanding tasks with limited data.

large language model, machine learning, mp-gui, (21 more...)

arXiv.org Artificial Intelligence

2503.14021

Country:

Asia > China (0.04)
North America > United States > Hawaii > Honolulu County > Honolulu (0.04)
North America > Canada > Quebec > Montreal (0.04)
(6 more...)

Genre: Research Report (1.00)

Industry: Information Technology > Security & Privacy (0.54)

Technology:

Information Technology > Human Computer Interaction > Interfaces (1.00)
Information Technology > Graphics (1.00)
Information Technology > Artificial Intelligence > Vision (1.00)
(3 more...)

Add feedback

Language Models Predict Empathy Gaps Between Social In-groups and Out-groups

Hou, Yu, Daumé, Hal III, Rudinger, Rachel

arXiv.org Artificial IntelligenceMar-2-2025

Studies of human psychology have demonstrated that people are more motivated to extend empathy to in-group members than out-group members (Cikara et al., 2011). In this study, we investigate how this aspect of intergroup relations in humans is replicated by LLMs in an emotion intensity prediction task. In this task, the LLM is given a short description of an experience a person had that caused them to feel a particular emotion; the LLM is then prompted to predict the intensity of the emotion the person experienced on a numerical scale. By manipulating the group identities assigned to the LLM's persona (the "perceiver") and the person in the narrative (the "experiencer"), we measure how predicted emotion intensities differ between in-group and out-group settings. We observe that LLMs assign higher emotion intensity scores to in-group members than out-group members. This pattern holds across all three types of social groupings we tested: race/ethnicity, nationality, and religion. We perform an in-depth analysis on Llama-3.1-8B, the model which exhibited strongest intergroup bias among those tested.

computational linguistic, intensity, llama-3, (16 more...)

arXiv.org Artificial Intelligence

2503.0103

Country:

Europe > United Kingdom (0.14)
Asia > Middle East > Palestine (0.04)
Asia > Middle East > Israel (0.04)
(29 more...)

Genre: Research Report > New Finding (0.87)

Industry:

Health & Medicine > Therapeutic Area (0.46)
Government > Regional Government (0.46)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.69)

Add feedback

ReWind: Understanding Long Videos with Instructed Learnable Memory

Diko, Anxhelo, Wang, Tinghuai, Swaileh, Wassim, Sun, Shiyan, Patras, Ioannis

arXiv.org Artificial IntelligenceNov-23-2024

Vision-Language Models (VLMs) are crucial for applications requiring integrated understanding textual and visual information. However, existing VLMs struggle with long videos due to computational inefficiency, memory limitations, and difficulties in maintaining coherent understanding across extended sequences. To address these challenges, we introduce ReWind, a novel memory-based VLM designed for efficient long video understanding while preserving temporal fidelity. ReWind operates in a two-stage framework. In the first stage, ReWind maintains a dynamic learnable memory module with a novel \textbf{read-perceive-write} cycle that stores and updates instruction-relevant visual information as the video unfolds. This module utilizes learnable queries and cross-attentions between memory contents and the input stream, ensuring low memory requirements by scaling linearly with the number of tokens. In the second stage, we propose an adaptive frame selection mechanism guided by the memory content to identify instruction-relevant key moments. It enriches the memory representations with detailed spatial information by selecting a few high-resolution frames, which are then combined with the memory contents and fed into a Large Language Model (LLM) to generate the final answer. We empirically demonstrate ReWind's superior performance in visual question answering (VQA) and temporal grounding tasks, surpassing previous methods on long video benchmarks. Notably, ReWind achieves a +13\% score gain and a +12\% accuracy improvement on the MovieChat-1K VQA dataset and an +8\% mIoU increase on Charades-STA for temporal grounding.

large language model, machine learning, natural language, (20 more...)

arXiv.org Artificial Intelligence

2411.15556

Country: Europe > Finland > Uusimaa > Helsinki (0.04)

Genre: Research Report (0.64)

Industry: Leisure & Entertainment (0.67)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.47)

Add feedback

E3D-GPT: Enhanced 3D Visual Foundation for Medical Vision-Language Model

Lai, Haoran, Jiang, Zihang, Yao, Qingsong, Wang, Rongsheng, He, Zhiyang, Tao, Xiaodong, Wei, Wei, Lv, Weifu, Zhou, S. Kevin

arXiv.org Artificial IntelligenceOct-18-2024

The development of 3D medical vision-language models holds significant potential for disease diagnosis and patient treatment. However, compared to 2D medical images, 3D medical images, such as CT scans, face challenges related to limited training data and high dimension, which severely restrict the progress of 3D medical vision-language models. To address these issues, we collect a large amount of unlabeled 3D CT data and utilize self-supervised learning to construct a 3D visual foundation model for extracting 3D visual features. Then, we apply 3D spatial convolutions to aggregate and project high-level image features, reducing computational complexity while preserving spatial information. We also construct two instruction-tuning datasets based on BIMCV-R and CT-RATE to fine-tune the 3D vision-language model. Our model demonstrates superior performance compared to existing methods in report generation, visual question answering, and disease diagnosis. Code and data will be made publicly available soon.

information, large language model, machine learning, (18 more...)

arXiv.org Artificial Intelligence

2410.142

Country:

Asia > China > Anhui Province > Hefei (0.04)
Asia > China > Beijing > Beijing (0.04)

Genre: Research Report (0.64)

Industry: Health & Medicine > Diagnostic Medicine > Imaging (1.00)

Technology:

Information Technology > Sensing and Signal Processing > Image Processing (1.00)
Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.98)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)

Add feedback

Investigating Role of Big Five Personality Traits in Audio-Visual Rapport Estimation

Hayashi, Takato, Kimura, Ryusei, Ishii, Ryo, Okada, Shogo

arXiv.org Artificial IntelligenceOct-7-2024

Automatic rapport estimation in social interactions is a central component of affective computing. Recent reports have shown that the estimation performance of rapport in initial interactions can be improved by using the participant's personality traits as the model's input. In this study, we investigate whether this findings applies to interactions between friends by developing rapport estimation models that utilize nonverbal cues (audio and facial expressions) as inputs. Our experimental results show that adding Big Five features (BFFs) to nonverbal features can improve the estimation performance of self-reported rapport in dyadic interactions between friends. Next, we demystify how BFFs improve the estimation performance of rapport through a comparative analysis between models with and without BFFs. We decompose rapport ratings into perceiver effects (people's tendency to rate other people), target effects (people's tendency to be rated by other people), and relationship effects (people's unique ratings for a specific person) using the social relations model. We then analyze the extent to which BFFs contribute to capturing each effect. Our analysis demonstrates that the perceiver's and the target's BFFs lead estimation models to capture the perceiver and the target effects, respectively. Furthermore, our experimental results indicate that the combinations of facial expression features and BFFs achieve best estimation performances not only in estimating rapport ratings, but also in estimating three effects. Our study is the first step toward understanding why personality-aware estimation models of interpersonal perception accomplish high estimation performance.

artificial intelligence, machine learning, rapport, (16 more...)

arXiv.org Artificial Intelligence

2410.11861

Country:

North America > United States > New York > New York County > New York City (0.04)
North America > United States > Washington > King County > Seattle (0.04)
North America > Canada (0.04)
(4 more...)

Genre: Research Report > New Finding (1.00)

Technology:

Information Technology > Artificial Intelligence > Vision (0.89)
Information Technology > Artificial Intelligence > Cognitive Science (0.68)
Information Technology > Artificial Intelligence > Representation & Reasoning (0.68)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.68)

Add feedback

PERSOMA: PERsonalized SOft ProMpt Adapter Architecture for Personalized Language Prompting

Hebert, Liam, Sayana, Krishna, Jash, Ambarish, Karatzoglou, Alexandros, Sodhi, Sukhdeep, Doddapaneni, Sumanth, Cai, Yanli, Kuzmin, Dima

arXiv.org Artificial IntelligenceAug-1-2024

Understanding the nuances of a user's extensive interaction history is key to building accurate and personalized natural language systems that can adapt to evolving user preferences. To address this, we introduce PERSOMA, Personalized Soft Prompt Adapter architecture. Unlike previous personalized prompting methods for large language models, PERSOMA offers a novel approach to efficiently capture user history. It achieves this by resampling and compressing interactions as free form text into expressive soft prompt embeddings, building upon recent research utilizing embedding representations as input for LLMs. We rigorously validate our approach by evaluating various adapter architectures, first-stage sampling strategies, parameter-efficient tuning techniques like LoRA, and other personalization methods. Our results demonstrate PERSOMA's superior ability to handle large and complex user histories compared to existing embedding-based and text-prompt-based techniques.

history, persoma, personalization, (12 more...)

arXiv.org Artificial Intelligence

2408.0096

Country:

North America > United States > California > Santa Clara County > Mountain View (0.05)
Europe > Spain > Catalonia > Barcelona Province > Barcelona (0.05)
South America > Colombia > Meta Department > Villavicencio (0.04)
(5 more...)

Genre: Research Report > New Finding (0.86)

Industry:

Leisure & Entertainment (0.68)
Media > Film (0.47)
Health & Medicine (0.46)

Technology: Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.90)

Add feedback