AITopics | gest

Collaborating Authors

gest

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

From Vision To Language through Graph of Events in Space and Time: An Explainable Self-supervised Approach

Masala, Mihai, Leordeanu, Marius

arXiv.org Artificial IntelligenceJul-8-2025

The task of describing video content in natural language is commonly referred to as video captioning. Unlike conventional video captions, which are typically brief and widely available, long-form paragraph descriptions in natural language are scarce. This limitation of current datasets is due to the expensive human manual annotation required and to the highly challenging task of explaining the language formation process from the perspective of the underlying story, as a complex system of interconnected events in space and time. Through a thorough analysis of recently published methods and available datasets, we identify a general lack of published resources dedicated to the problem of describing videos in complex language, beyond the level of descriptions in the form of enumerations of simple captions. Furthermore, while state-of-the-art methods produce impressive results on the task of generating shorter captions from videos by direct end-to-end learning between the videos and text, the problem of explaining the relationship between vision and language is still beyond our reach. In this work, we propose a shared representation between vision and language, based on graphs of events in space and time, which can be obtained in an explainable and analytical way, to integrate and connect multiple vision tasks to produce the final natural language description. Moreover, we also demonstrate how our automated and explainable video description generation process can function as a fully automatic teacher to effectively train direct, end-to-end neural student pathways, within a self-supervised neuro-analytical system. We validate that our explainable neuro-analytical approach generates coherent, rich and relevant textual descriptions on videos collected from multiple varied datasets, using both standard evaluation metrics, human annotations and consensus from ensembles of state-of-the-art VLMs.

large language model, machine learning, natural language, (15 more...)

arXiv.org Artificial Intelligence

2507.04815

Country:

Europe (0.93)
North America > United States (0.92)

Genre: Research Report > Promising Solution (0.48)

Industry: Education (0.46)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.93)

Add feedback

Towards Zero-Shot & Explainable Video Description by Reasoning over Graphs of Events in Space and Time

Masala, Mihai, Leordeanu, Marius

arXiv.org Artificial IntelligenceJan-14-2025

Moreover, such models suffer from overfitting such that Transformer-based solutions are the backbone of current once given a video from an unseen context or distribution state-of-the-art methods for language generation, image the quality and accuracy of the description drops, as our and video classification, segmentation, action and object evaluations prove. On the other hand, VLLMs have shown recognition, among many others. Interestingly enough, impressive results, being capable of generating long, rich while these state-of-the-art methods produce impressive results descriptions of videos. Unfortunately VLLMs still share in their respective domains, the problem of understanding some of the same weaknesses as previous methods: they are the relationship between vision and language is largely unexplainable and they still rely on sampling frames still beyond our reach. In this work, we propose a common to process a video. Moreover, top-performing models such ground between vision and language based on events as GPT, Claude or Gemini are not open and are only accessible in space and time in an explainable and programmatic way, via an paid API. to connect learning-based vision and language state of the We argue that one of the main reasons why this interdisciplinary art models and provide a solution to the long standing problem cross-domain task is still far from being solved is of describing videos in natural language. We validate that we still lack an explainable way to bridge this apparently that our algorithmic approach is able to generate coherent, insurmountable gap. Explainability could provide a rich and relevant textual descriptions on videos collected more analytical and stage-wise way to make the transition from a variety of datasets, using both standard metrics (e.g. from vision to language that is both trustworthy and makes Bleu, ROUGE) and the modern LLM-as-a-Jury approach.

large language model, machine learning, natural language, (19 more...)

arXiv.org Artificial Intelligence

2501.0846

Genre: Research Report > Promising Solution (0.54)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

Kinesthetic Teaching in Robotics: a Mixed Reality Approach

Macci`o, Simone, Shaaban, Mohamad, Carf`ı, Alessandro, Mastrogiovanni, Fulvio

arXiv.org Artificial IntelligenceSep-3-2024

Abstract-- As collaborative robots become more common in manufacturing scenarios and adopted in hybrid human-robot teams, we should develop new interaction and communication strategies to ensure smooth collaboration between agents. In this paper, we propose a novel communicative interface that uses Mixed Reality as a medium to perform Kinesthetic Teaching (KT) on any robotic platform. We evaluate our proposed approach in a user study involving multiple subjects and two different robots, comparing traditional physical KT with holographic-based KT through user experience questionnaires and task-related metrics. Index Terms-- Human-Robot Interaction, Mixed Reality, Kinesthetic Teaching, Software Architecture. In smart factories, robots are expected to coexist and work alongside humans rather than replace them.

communication, interface, robot, (15 more...)

arXiv.org Artificial Intelligence

2409.02305

Country:

North America > United States > Illinois > Cook County > Chicago (0.04)
Asia > South Korea > Daegu > Daegu (0.04)
Asia > Macao (0.04)
(16 more...)

Genre:

Research Report > New Finding (0.68)
Research Report > Experimental Study (0.46)

Technology:

Information Technology > Artificial Intelligence > Robots > Humanoid Robots (0.57)
Information Technology > Artificial Intelligence > Robots > Robots in the Workplace (0.48)

Add feedback

GEST: the Graph of Events in Space and Time as a Common Representation between Vision and Language

Masala, Mihai, Cudlenco, Nicolae, Rebedea, Traian, Leordeanu, Marius

arXiv.org Artificial IntelligenceMay-22-2023

One of the essential human skills is the ability to seamlessly build an inner representation of the world. By exploiting this representation, humans are capable of easily finding consensus between visual, auditory and linguistic perspectives. In this work, we set out to understand and emulate this ability through an explicit representation for both vision and language - Graphs of Events in Space and Time (GEST). GEST alows us to measure the similarity between texts and videos in a semantic and fully explainable way, through graph matching. It also allows us to generate text and videos from a common representation that provides a well understood content. In this work we show that the graph matching similarity metrics based on GEST outperform classical text generation metrics and can also boost the performance of state of art, heavily trained metrics.

gest, machine learning, natural language, (19 more...)

arXiv.org Artificial Intelligence

2305.1294

Country:

North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
South America > Chile > Santiago Metropolitan Region > Santiago Province > Santiago (0.04)
North America > United States > Texas (0.04)
(6 more...)

Genre: Overview (0.67)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Vision (0.95)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.93)

Add feedback

bm371613/gest

#artificialintelligenceOct-23-2020, 15:45:35 GMT

For health related reasons, I had to stop using a mouse and a keyboard. Talon allowed me to type with my voice and move the cursor with my eyes. This project was started to complement this setup with hand gestures. The project is in an early stage of development. I use it on daily basis, so it should be good enough for some.

artificial intelligence, gest, machine learning, (5 more...)

#artificialintelligence

Technology:

Information Technology > Artificial Intelligence > Vision > Gesture Recognition (0.41)
Information Technology > Artificial Intelligence > Machine Learning (0.35)

Add feedback