AITopics | Pantazopoulos, Georgios

Collaborating Authors

Pantazopoulos, Georgios

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

Evaluating Multimodal Language Models as Visual Assistants for Visually Impaired Users

Karamolegkou, Antonia, Nikandrou, Malvina, Pantazopoulos, Georgios, Villegas, Danae Sanchez, Rust, Phillip, Dhar, Ruchira, Hershcovich, Daniel, Søgaard, Anders

arXiv.org Artificial IntelligenceMar-28-2025

This paper explores the effectiveness of Multimodal Large Language models (MLLMs) as assistive technologies for visually impaired individuals. We conduct a user survey to identify adoption patterns and key challenges users face with such technologies. Despite a high adoption rate of these models, our findings highlight concerns related to contextual understanding, cultural sensitivity, and complex scene understanding, particularly for individuals who may rely solely on them for visual interpretation. Informed by these results, we collate five user-centred tasks with image and video inputs, including a novel task on Optical Braille Recognition. Our systematic evaluation of twelve MLLMs reveals that further advancements are necessary to overcome limitations related to cultural context, multilingual support, Braille reading comprehension, assistive object recognition, and hallucinations. This work provides critical insights into the future direction of multimodal AI for accessibility, underscoring the need for more inclusive, robust, and trustworthy visual assistance technologies.

artificial intelligence, large language model, natural language, (17 more...)

arXiv.org Artificial Intelligence

2503.2261

Country:

Europe (0.92)
Asia (0.92)
North America > United States (0.28)
North America > Mexico (0.28)

Genre:

Research Report > New Finding (1.00)
Questionnaire & Opinion Survey (1.00)

Industry:

Health & Medicine (1.00)
Education (1.00)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Natural Language > Machine Translation (0.93)

Add feedback

CROPE: Evaluating In-Context Adaptation of Vision and Language Models to Culture-Specific Concepts

Nikandrou, Malvina, Pantazopoulos, Georgios, Vitsakis, Nikolas, Konstas, Ioannis, Suglia, Alessandro

arXiv.org Artificial IntelligenceOct-20-2024

As Vision and Language models (VLMs) become accessible across the globe, it is important that they demonstrate cultural knowledge. In this paper, we introduce CROPE, a visual question answering benchmark designed to probe the knowledge of culture-specific concepts and evaluate the capacity for cultural adaptation through contextual information. This allows us to distinguish between parametric knowledge acquired during training and contextual knowledge provided during inference via visual and textual descriptions. Our evaluation of several state-of-the-art open VLMs shows large performance disparities between culture-specific and common concepts in the parametric setting. Moreover, experiments with contextual knowledge indicate that models struggle to effectively utilize multimodal information and bind culture-specific concepts to their depictions. Our findings reveal limitations in the cultural understanding and adaptability of current VLMs that need to be addressed toward more culturally inclusive models.

computational linguistic, large language model, machine learning, (19 more...)

arXiv.org Artificial Intelligence

2410.15453

Country: Asia > Indonesia (0.28)

Genre: Research Report > New Finding (1.00)

Industry: Information Technology (0.46)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Natural Language > Chatbot (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

Learning To See But Forgetting To Follow: Visual Instruction Tuning Makes LLMs More Prone To Jailbreak Attacks

Pantazopoulos, Georgios, Parekh, Amit, Nikandrou, Malvina, Suglia, Alessandro

arXiv.org Artificial IntelligenceMay-7-2024

Augmenting Large Language Models (LLMs) with image-understanding capabilities has resulted in a boom of high-performing Vision-Language models (VLMs). While studying the alignment of LLMs to human values has received widespread attention, the safety of VLMs has not received the same attention. In this paper, we explore the impact of jailbreaking on three state-of-the-art VLMs, each using a distinct modeling approach. By comparing each VLM to their respective LLM backbone, we find that each VLM is more susceptible to jailbreaking. We consider this as an undesirable outcome from visual instruction-tuning, which imposes a forgetting effect on an LLM's safety guardrails. Therefore, we provide recommendations for future work based on evaluation strategies that aim to highlight the weaknesses of a VLM, as well as take safety measures into account during visual instruction tuning.

large language model, machine learning, natural language, (17 more...)

arXiv.org Artificial Intelligence

2405.04403

Genre: Research Report (0.82)

Industry:

Law (0.68)
Government (0.46)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

Lost in Space: Probing Fine-grained Spatial Understanding in Vision and Language Resamplers

Pantazopoulos, Georgios, Suglia, Alessandro, Lemon, Oliver, Eshghi, Arash

arXiv.org Artificial IntelligenceApr-21-2024

An effective method for combining frozen large language models (LLM) and visual encoders involves a resampler module that creates a `visual prompt' which is provided to the LLM, along with the textual prompt. While this approach has enabled impressive performance across many coarse-grained tasks like image captioning and visual question answering, more fine-grained tasks that require spatial understanding have not been thoroughly examined. In this paper, we use \textit{diagnostic classifiers} to measure the extent to which the visual prompt produced by the resampler encodes spatial information. Our results show that this information is largely absent from the resampler output when kept frozen during training of the classifiers. However, when the resampler and classifier are trained jointly, we observe a significant performance boost. This shows that the compression achieved by the resamplers can in principle encode the requisite spatial information, but that more object-aware objectives are needed at the pretraining stage to facilitate this capability

large language model, machine learning, natural language, (19 more...)

arXiv.org Artificial Intelligence

2404.13594

Country:

North America > United States (0.14)
Europe > Switzerland (0.14)

Genre: Research Report > New Finding (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)

Add feedback

Multitask Multimodal Prompted Training for Interactive Embodied Task Completion

Pantazopoulos, Georgios, Nikandrou, Malvina, Parekh, Amit, Hemanthage, Bhathiya, Eshghi, Arash, Konstas, Ioannis, Rieser, Verena, Lemon, Oliver, Suglia, Alessandro

arXiv.org Artificial IntelligenceNov-7-2023

Interactive and embodied tasks pose at least two fundamental challenges to existing Vision & Language (VL) models, including 1) grounding language in trajectories of actions and observations, and 2) referential disambiguation. To tackle these challenges, we propose an Embodied MultiModal Agent (EMMA): a unified encoder-decoder model that reasons over images and trajectories, and casts action prediction as multimodal text generation. By unifying all tasks as text generation, EMMA learns a language of actions which facilitates transfer across tasks. Different to previous modular approaches with independently trained components, we use a single multitask model where each task contributes to goal completion. EMMA performs on par with similar models on several VL benchmarks and sets a new state-of-the-art performance (36.81% success rate) on the Dialog-guided Task Completion (DTC), a benchmark to evaluate dialog-guided agents in the Alexa Arena

large language model, machine learning, natural language, (20 more...)

arXiv.org Artificial Intelligence

2311.04067

Country:

Europe (0.92)
Asia > Middle East (0.46)

Genre: Research Report (0.64)

Industry: Education (0.46)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Robots (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
(2 more...)

Add feedback