Aragon-Camarasa, Gerardo
Zero-Shot Interactive Text-to-Image Retrieval via Diffusion-Augmented Representations
Long, Zijun, Liang, Kangheng, Aragon-Camarasa, Gerardo, Mccreadie, Richard, Henderson, Paul
Interactive Text-to-Image Retrieval (I-TIR) has emerged as a transformative user-interactive tool for applications in domains such as e-commerce and education. Yet, current methodologies predominantly depend on finetuned Multimodal Large Language Models (MLLMs), which face two critical limitations: (1) Finetuning imposes prohibitive computational overhead and long-term maintenance costs. (2) Finetuning narrows the pretrained knowledge distribution of MLLMs, reducing their adaptability to novel scenarios. These issues are exacerbated by the inherently dynamic nature of real-world I-TIR systems, where queries and image databases evolve in complexity and diversity, often deviating from static training distributions. To overcome these constraints, we propose Diffusion Augmented Retrieval (DAR), a paradigm-shifting framework that bypasses MLLM finetuning entirely. DAR synergizes Large Language Model (LLM)-guided query refinement with Diffusion Model (DM)-based visual synthesis to create contextually enriched intermediate representations. This dual-modality approach deciphers nuanced user intent more holistically, enabling precise alignment between textual queries and visually relevant images. Rigorous evaluations across four benchmarks reveal DAR's dual strengths: (1) Matches state-of-the-art finetuned I-TIR models on straightforward queries without task-specific training. (2) Scalable Generalization: Surpasses finetuned baselines by 7.61% in Hits@10 (top-10 accuracy) under multi-turn conversational complexity, demonstrating robustness to intricate, distributionally shifted interactions. By eliminating finetuning dependencies and leveraging generative-augmented representations, DAR establishes a new trajectory for efficient, adaptive, and scalable cross-modal retrieval systems.
Breaking Down the Barriers: Investigating Non-Expert User Experiences in Robotic Teleoperation in UK and Japan
Audonnet, Florent P, Hamilton, Andrew, Domae, Yakiyasu, Ramirez-Alpizar, Ixchel G, Aragon-Camarasa, Gerardo
Robots are being created each year with the goal of integrating them into our daily lives. As such, there is an interest in research in evaluating the trust of humans toward robots. In addition, teleoperating robotic arms can be challenging for non-experts. To reduce the strain put on the user, we created TELESIM, a modular and plug-and-play framework that enables direct teleoperation of any robotic arm using a digital twin as the interface between users and the robotic system. We evaluated our framework using a user survey with three robots and control methods and recorded the user's workload and performance at completing a tower stacking task. However, an analysis of the strain on the user and their ability to trust robots was omitted. This paper addresses these omissions by presenting the additional results of our user survey of 37 participants carried out in United Kingdom. In addition, we present the results of an additional user survey, under similar conditions performed in Japan, with the goal of addressing the limitations of our previous approach, by interfacing a VR controller with a UR5e. Our experimental results show that the UR5e has more towers built. Additionally, the UR5e gives the least amount of cognitive stress, while the combination of Senseglove and UR3 provides the user with the highest physical strain and causes the user to feel more frustrated. Finally, the Japanese participants seem more trusting of robots than the British participants.
Flat'n'Fold: A Diverse Multi-Modal Dataset for Garment Perception and Manipulation
Zhuang, Lipeng, Fan, Shiyu, Ru, Yingdong, Audonnet, Florent, Henderson, Paul, Aragon-Camarasa, Gerardo
Abstract-- We present Flat'n'Fold, a novel large-scale dataset for garment manipulation that addresses critical gaps in existing datasets. We quantify the dataset's diversity and complexity compared to existing benchmarks and show that our dataset features natural and diverse manipulations of real-world demonstrations of human and robot demonstrations in terms of visual and action information. To showcase Flat'n'Fold's utility, we establish new benchmarks for grasping point prediction and This underscores Flat'n'Fold's potential to drive advances in robotic perception and manipulation of deformable objects. Human-controlled Robot Demonstrations, where an expert Manipulating garments remains a significant challenge in human operator controls a robot to execute similar robotics. Tasks such as flattening and folding require understanding garment manipulation tasks, aiming to replicate natural, the vast space of configurations that garments can human-like approaches within the robot's operational adopt [1], [2], and planning complex sequences of actions limitations.
CLCE: An Approach to Refining Cross-Entropy and Contrastive Learning for Optimized Learning Fusion
Long, Zijun, Killick, George, Zhuang, Lipeng, Aragon-Camarasa, Gerardo, Meng, Zaiqiao, Mccreadie, Richard
State-of-the-art pre-trained image models predominantly adopt a two-stage approach: initial unsupervised pre-training on large-scale datasets followed by task-specific fine-tuning using Cross-Entropy loss~(CE). However, it has been demonstrated that CE can compromise model generalization and stability. While recent works employing contrastive learning address some of these limitations by enhancing the quality of embeddings and producing better decision boundaries, they often overlook the importance of hard negative mining and rely on resource intensive and slow training using large sample batches. To counter these issues, we introduce a novel approach named CLCE, which integrates Label-Aware Contrastive Learning with CE. Our approach not only maintains the strengths of both loss functions but also leverages hard negative mining in a synergistic way to enhance performance. Experimental results demonstrate that CLCE significantly outperforms CE in Top-1 accuracy across twelve benchmarks, achieving gains of up to 3.52% in few-shot learning scenarios and 3.41% in transfer learning settings with the BEiT-3 model. Importantly, our proposed CLCE approach effectively mitigates the dependency of contrastive learning on large batch sizes such as 4096 samples per batch, a limitation that has previously constrained the application of contrastive learning in budget-limited hardware environments.
Foveation in the Era of Deep Learning
Killick, George, Henderson, Paul, Siebert, Paul, Aragon-Camarasa, Gerardo
Many biological vision systems sense the world with a foveated sensor, where the highest resolution processing is limited to only a small central portion of the visual field (the fovea). Computer vision systems have taken inspiration from this aspect of biological vision and incorporated it into visual attention models that learn to sample and process visual scenes actively [1, 2, 3]. The promise of foveated vision is the ability to resolve and process fine details while simultaneously maintaining a wide field of view, which has applications to problems where semantic information can exist over a high-dynamic range of scales. More generally, it is well known that scaling the resolution of inputs to CNNs can reliably improve accuracy in objection recognition problems [4]. Through sparse sampling in the periphery of the field of view, foveated sensors can achieve this with significantly fewer pixels than a uniform sensor, making it an appealing approach to building parsimonious vision systems.
TELESIM: A Modular and Plug-and-Play Framework for Robotic Arm Teleoperation using a Digital Twin
Audonnet, Florent P, Grizou, Jonathan, Hamilton, Andrew, Aragon-Camarasa, Gerardo
We present TELESIM, a modular and plug-and-play framework for direct teleoperation of a robotic arm using a digital twin as the interface between the user and the robotic system. We tested TELESIM by performing a user survey with 37 participants on two different robots using two different control modalities: a virtual reality controller and a finger mapping hardware controller using different grasping systems. Users were asked to teleoperate the robot to pick and place 3 cubes in a tower and to repeat this task as many times as possible in 10 minutes, with only 5 minutes of training beforehand. Our experimental results show that most users were able to succeed by building at least a tower of 3 cubes regardless of the control modality or robot used, demonstrating the user-friendliness of TELESIM.
Enabling the Sense of Self in a Dual-Arm Robot
AlQallaf, Ali, Aragon-Camarasa, Gerardo
While humans are aware of their body and capabilities, robots are not. To address this, we present in this paper a neural network architecture that enables a dual-arm robot to get a sense of itself in an environment. Our approach is inspired by human self-awareness developmental levels and serves as the underlying building block for a robot to achieve awareness of itself while carrying out tasks in an environment. We assume that a robot has to know itself before interacting with the environment in order to be able to support different robotic tasks. Hence, we implemented a neural network architecture to enable a robot to differentiate its limbs from the environment using visual and proprioception sensory inputs. We demonstrate experimentally that a robot can distinguish itself with an accuracy of 88.7% on average in cluttered environmental settings and under confounding input signals.
Intrinsic Robotic Introspection: Learning Internal States From Neuron Activations
Pitsillos, Nikos, Pore, Ameya, Jensen, Bjorn Sand, Aragon-Camarasa, Gerardo
We present an introspective framework inspired by the process of how humans perform introspection. Our working assumption is that neural network activations encode information, and building internal states from these activations can improve the performance of an actor-critic model. We perform experiments where we first train a Variational Autoencoder model to reconstruct the activations of a feature extraction network and use the latent space to improve the performance of an actor-critic when deciding which low-level robotic behaviour to execute. We show that internal states reduce the number of episodes needed by about 1300 episodes while training an actor-critic, denoting faster convergence to get a high success value while completing a robotic task.