gest
From Vision To Language through Graph of Events in Space and Time: An Explainable Self-supervised Approach
Masala, Mihai, Leordeanu, Marius
The task of describing video content in natural language is commonly referred to as video captioning. Unlike conventional video captions, which are typically brief and widely available, long-form paragraph descriptions in natural language are scarce. This limitation of current datasets is due to the expensive human manual annotation required and to the highly challenging task of explaining the language formation process from the perspective of the underlying story, as a complex system of interconnected events in space and time. Through a thorough analysis of recently published methods and available datasets, we identify a general lack of published resources dedicated to the problem of describing videos in complex language, beyond the level of descriptions in the form of enumerations of simple captions. Furthermore, while state-of-the-art methods produce impressive results on the task of generating shorter captions from videos by direct end-to-end learning between the videos and text, the problem of explaining the relationship between vision and language is still beyond our reach. In this work, we propose a shared representation between vision and language, based on graphs of events in space and time, which can be obtained in an explainable and analytical way, to integrate and connect multiple vision tasks to produce the final natural language description. Moreover, we also demonstrate how our automated and explainable video description generation process can function as a fully automatic teacher to effectively train direct, end-to-end neural student pathways, within a self-supervised neuro-analytical system. We validate that our explainable neuro-analytical approach generates coherent, rich and relevant textual descriptions on videos collected from multiple varied datasets, using both standard evaluation metrics, human annotations and consensus from ensembles of state-of-the-art VLMs.
- Europe > Romania > București - Ilfov Development Region > Municipality of Bucharest > Bucharest (0.04)
- North America > United States > Pennsylvania > Philadelphia County > Philadelphia (0.04)
- North America > United States > Michigan > Washtenaw County > Ann Arbor (0.04)
- (3 more...)
Towards Zero-Shot & Explainable Video Description by Reasoning over Graphs of Events in Space and Time
Masala, Mihai, Leordeanu, Marius
Moreover, such models suffer from overfitting such that Transformer-based solutions are the backbone of current once given a video from an unseen context or distribution state-of-the-art methods for language generation, image the quality and accuracy of the description drops, as our and video classification, segmentation, action and object evaluations prove. On the other hand, VLLMs have shown recognition, among many others. Interestingly enough, impressive results, being capable of generating long, rich while these state-of-the-art methods produce impressive results descriptions of videos. Unfortunately VLLMs still share in their respective domains, the problem of understanding some of the same weaknesses as previous methods: they are the relationship between vision and language is largely unexplainable and they still rely on sampling frames still beyond our reach. In this work, we propose a common to process a video. Moreover, top-performing models such ground between vision and language based on events as GPT, Claude or Gemini are not open and are only accessible in space and time in an explainable and programmatic way, via an paid API. to connect learning-based vision and language state of the We argue that one of the main reasons why this interdisciplinary art models and provide a solution to the long standing problem cross-domain task is still far from being solved is of describing videos in natural language. We validate that we still lack an explainable way to bridge this apparently that our algorithmic approach is able to generate coherent, insurmountable gap. Explainability could provide a rich and relevant textual descriptions on videos collected more analytical and stage-wise way to make the transition from a variety of datasets, using both standard metrics (e.g. from vision to language that is both trustworthy and makes Bleu, ROUGE) and the modern LLM-as-a-Jury approach.
Kinesthetic Teaching in Robotics: a Mixed Reality Approach
Macci`o, Simone, Shaaban, Mohamad, Carf`ı, Alessandro, Mastrogiovanni, Fulvio
Abstract-- As collaborative robots become more common in manufacturing scenarios and adopted in hybrid human-robot teams, we should develop new interaction and communication strategies to ensure smooth collaboration between agents. In this paper, we propose a novel communicative interface that uses Mixed Reality as a medium to perform Kinesthetic Teaching (KT) on any robotic platform. We evaluate our proposed approach in a user study involving multiple subjects and two different robots, comparing traditional physical KT with holographic-based KT through user experience questionnaires and task-related metrics. Index Terms-- Human-Robot Interaction, Mixed Reality, Kinesthetic Teaching, Software Architecture. In smart factories, robots are expected to coexist and work alongside humans rather than replace them.
- North America > United States > Illinois > Cook County > Chicago (0.04)
- Asia > South Korea > Daegu > Daegu (0.04)
- Asia > Macao (0.04)
- (16 more...)
- Research Report > New Finding (0.68)
- Research Report > Experimental Study (0.46)
GEST: the Graph of Events in Space and Time as a Common Representation between Vision and Language
Masala, Mihai, Cudlenco, Nicolae, Rebedea, Traian, Leordeanu, Marius
One of the essential human skills is the ability to seamlessly build an inner representation of the world. By exploiting this representation, humans are capable of easily finding consensus between visual, auditory and linguistic perspectives. In this work, we set out to understand and emulate this ability through an explicit representation for both vision and language - Graphs of Events in Space and Time (GEST). GEST alows us to measure the similarity between texts and videos in a semantic and fully explainable way, through graph matching. It also allows us to generate text and videos from a common representation that provides a well understood content. In this work we show that the graph matching similarity metrics based on GEST outperform classical text generation metrics and can also boost the performance of state of art, heavily trained metrics.
- North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
- South America > Chile > Santiago Metropolitan Region > Santiago Province > Santiago (0.04)
- North America > United States > Texas (0.04)
- (6 more...)
- Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
- Information Technology > Artificial Intelligence > Natural Language (1.00)
- Information Technology > Artificial Intelligence > Vision (0.95)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.93)
bm371613/gest
For health related reasons, I had to stop using a mouse and a keyboard. Talon allowed me to type with my voice and move the cursor with my eyes. This project was started to complement this setup with hand gestures. The project is in an early stage of development. I use it on daily basis, so it should be good enough for some.