Goto

Collaborating Authors

 Khan, Muhammad Haris


CognitiveDrone: A VLA Model and Evaluation Benchmark for Real-Time Cognitive Task Solving and Reasoning in UAVs

arXiv.org Artificial Intelligence

CognitiveDrone: A VLA Model and Evaluation Benchmark for Real-Time Cognitive T ask Solving and Reasoning in UA Vs Artem Lykov, V alerii Serpiva, Muhammad Haris Khan, Oleg Sautenkov, Artyom Myshlyaev, Grik Tadevosyan, Y asheerah Y aqoot, and Dzmitry Tsetserukou Abstract -- This paper introduces CognitiveDrone, a novel Vision-Language-Action (VLA) model tailored for complex Unmanned Aerial V ehicles (UA Vs) tasks that demand advanced cognitive abilities. Trained on a dataset comprising over 8,000 simulated flight trajectories across three key categories--Human Recognition, Symbol Understanding, and Reasoning--the model generates real-time 4D action commands based on first-person visual inputs and textual instructions. T o further enhance performance in intricate scenarios, we propose CognitiveDrone-R1, which integrates an additional Vision-Language Model (VLM) reasoning module to simplify task directives prior to high-frequency control. Experimental evaluations using our open-source benchmark, CognitiveDroneBench, reveal that while a racing-oriented model (RaceVLA) achieves an overall success rate of 31.3%, the base CognitiveDrone model reaches 59.6%, and CognitiveDrone-R1 attains a success rate of 77.2%. These results demonstrate improvements of up to 30% in critical cognitive tasks, underscoring the effectiveness of incorporating advanced reasoning capabilities into UA V control systems. Our contributions include the development of a state-of-the-art VLA model for UA V control and the introduction of the first dedicated benchmark for assessing cognitive tasks in drone operations.


Evolution 6.0: Evolving Robotic Capabilities Through Generative Design

arXiv.org Artificial Intelligence

We propose a new concept, Evolution 6.0, which represents the evolution of robotics driven by Generative AI. When a robot lacks the necessary tools to accomplish a task requested by a human, it autonomously designs the required instruments and learns how to use them to achieve the goal. Evolution 6.0 is an autonomous robotic system powered by Vision-Language Models (VLMs), Vision-Language Action (VLA) models, and Text-to-3D generative models for tool design and task execution. The system comprises two key modules: the Tool Generation Module, which fabricates task-specific tools from visual and textual data, and the Action Generation Module, which converts natural language instructions into robotic actions. It integrates QwenVLM for environmental understanding, OpenVLA for task execution, and Llama-Mesh for 3D tool generation. Evaluation results demonstrate a 90% success rate for tool generation with a 10-second inference time, and action generation achieving 83.5% in physical and visual generalization, 70% in motion generalization, and 37% in semantic generalization. Future improvements will focus on bimanual manipulation, expanded task capabilities, and enhanced environmental interpretation to improve real-world adaptability.


GazeGrasp: DNN-Driven Robotic Grasping with Wearable Eye-Gaze Interface

arXiv.org Artificial Intelligence

Abstract--We present GazeGrasp, a gaze-based manipulation system enabling individuals with motor impairments to control collaborative robots using eye-gaze. The system employs an ESP32 CAM for eye tracking, MediaPipe for gaze detection, and YOLOv8 for object localization, integrated with a Universal Robot UR10 for manipulation tasks. After user-specific calibration, the system allows intuitive object selection with a magnetic snapping effect and robot control via eye gestures. Experimental evaluation involving 13 participants demonstrated that the magnetic snapping effect significantly reduced gaze alignment time, improving task efficiency by 31%. GazeGrasp provides a robust, hands-free interface for assistive robotics, enhancing accessibility and autonomy for users.


NT-VOT211: A Large-Scale Benchmark for Night-time Visual Object Tracking

arXiv.org Artificial Intelligence

Many current visual object tracking benchmarks such as OTB100, NfS, UAV123, LaSOT, and GOT-10K, predominantly contain day-time scenarios while the challenges posed by the night-time has been less investigated. It is primarily because of the lack of a large-scale, well-annotated night-time benchmark for rigorously evaluating tracking algorithms. To this end, this paper presents NT-VOT211, a new benchmark tailored for evaluating visual object tracking algorithms in the challenging night-time conditions. NT-VOT211 consists of 211 diverse videos, offering 211,000 well-annotated frames with 8 attributes including camera motion, deformation, fast motion, motion blur, tiny target, distractors, occlusion and out-of-view. To the best of our knowledge, it is the largest night-time tracking benchmark to-date that is specifically designed to address unique challenges such as adverse visibility, image blur, and distractors inherent to night-time tracking scenarios. Through a comprehensive analysis of results obtained from 42 diverse tracking algorithms on NT-VOT211, we uncover the strengths and limitations of these algorithms, highlighting opportunities for enhancements in visual object tracking, particularly in environments with suboptimal lighting. Besides, a leaderboard for revealing performance rankings, annotation tools, comprehensive meta-information and all the necessary code for reproducibility of results is made publicly available. We believe that our NT-VOT211 benchmark will not only be instrumental in facilitating field deployment of VOT algorithms, but will also help VOT enhancements and it will unlock new real-world tracking applications. Our dataset and other assets can be found at: {https://github.com/LiuYuML/NV-VOT211.


Towards Combating Frequency Simplicity-biased Learning for Domain Generalization

arXiv.org Artificial Intelligence

Domain generalization methods aim to learn transferable knowledge from source domains that can generalize well to unseen target domains. Recent studies show that neural networks frequently suffer from a simplicity-biased learning behavior which leads to over-reliance on specific frequency sets, namely as frequency shortcuts, instead of semantic information, resulting in poor generalization performance. Despite previous data augmentation techniques successfully enhancing generalization performances, they intend to apply more frequency shortcuts, thereby causing hallucinations of generalization improvement. In this paper, we aim to prevent such learning behavior of applying frequency shortcuts from a data-driven perspective. Given the theoretical justification of models' biased learning behavior on different spatial frequency components, which is based on the dataset frequency properties, we argue that the learning behavior on various frequency components could be manipulated by changing the dataset statistical structure in the Fourier domain. Intuitively, as frequency shortcuts are hidden in the dominant and highly dependent frequencies of dataset structure, dynamically perturbating the over-reliance frequency components could prevent the application of frequency shortcuts. To this end, we propose two effective data augmentation modules designed to collaboratively and adaptively adjust the frequency characteristic of the dataset, aiming to dynamically influence the learning behavior of the model and ultimately serving as a strategy to mitigate shortcut learning. Code is available at AdvFrequency.


Multi-modal Medical Image Fusion For Non-Small Cell Lung Cancer Classification

arXiv.org Artificial Intelligence

The early detection and nuanced subtype classification of non-small cell lung cancer (NSCLC), a predominant cause of cancer mortality worldwide, is a critical and complex issue. In this paper, we introduce an innovative integration of multi-modal data, synthesizing fused medical imaging (CT and PET scans) with clinical health records and genomic data. This unique fusion methodology leverages advanced machine learning models, notably MedClip and BEiT, for sophisticated image feature extraction, setting a new standard in computational oncology. Our research surpasses existing approaches, as evidenced by a substantial enhancement in NSCLC detection and classification precision. The results showcase notable improvements across key performance metrics, including accuracy, precision, recall, and F1-score. Specifically, our leading multi-modal classifier model records an impressive accuracy of 94.04%. We believe that our approach has the potential to transform NSCLC diagnostics, facilitating earlier detection and more effective treatment planning and, ultimately, leading to superior patient outcomes in lung cancer care.


Attention Based Simple Primitives for Open World Compositional Zero-Shot Learning

arXiv.org Artificial Intelligence

Compositional Zero-Shot Learning (CZSL) aims to predict unknown compositions made up of attribute and object pairs. Predicting compositions unseen during training is a challenging task. We are exploring Open World Compositional Zero-Shot Learning (OW-CZSL) in this study, where our test space encompasses all potential combinations of attributes and objects. Our approach involves utilizing the self-attention mechanism between attributes and objects to achieve better generalization from seen to unseen compositions. Utilizing a self-attention mechanism facilitates the model's ability to identify relationships between attribute and objects. The similarity between the self-attended textual and visual features is subsequently calculated to generate predictions during the inference phase. The potential test space may encompass implausible object-attribute combinations arising from unrestricted attribute-object pairings. To mitigate this issue, we leverage external knowledge from ConceptNet to restrict the test space to realistic compositions. Our proposed model, Attention-based Simple Primitives (ASP), demonstrates competitive performance, achieving results comparable to the state-of-the-art.


Multiclass Alignment of Confidence and Certainty for Network Calibration

arXiv.org Artificial Intelligence

Deep neural networks (DNNs) have made great strides in pushing the state-of-the-art in several challenging domains. Recent studies reveal that they are prone to making overconfident predictions. This greatly reduces the overall trust in model predictions, especially in safety-critical applications. Early work in improving model calibration employs post-processing techniques which rely on limited parameters and require a hold-out set. Some recent train-time calibration methods, which involve all model parameters, can outperform the postprocessing methods. To this end, we propose a new train-time calibration method, which features a simple, plug-and-play auxiliary loss known as multi-class alignment of predictive mean confidence and predictive certainty (MACC). It is based on the observation that a model miscalibration is directly related to its predictive certainty, so a higher gap between the mean confidence and certainty amounts to a poor calibration both for in-distribution and out-of-distribution predictions. Armed with this insight, our proposed loss explicitly encourages a confident (or underconfident) model to also provide a low (or high) spread in the presoftmax distribution. Extensive experiments on ten challenging datasets, covering in-domain, out-domain, non-visual recognition and medical image classification scenarios, show that our method achieves state-of-the-art calibration performance for both in-domain and out-domain predictions. Our code and models will be publicly released.


Unsupervised Deep Graph Matching Based on Cycle Consistency

arXiv.org Artificial Intelligence

We contribute to the sparsely populated area of unsupervised deep graph matching with application to keypoint matching in images. Contrary to the standard \emph{supervised} approach, our method does not require ground truth correspondences between keypoint pairs. Instead, it is self-supervised by enforcing consistency of matchings between images of the same object category. As the matching and the consistency loss are discrete, their derivatives cannot be straightforwardly used for learning. We address this issue in a principled way by building our method upon the recent results on black-box differentiation of combinatorial solvers. This makes our method exceptionally flexible, as it is compatible with arbitrary network architectures and combinatorial solvers. Our experimental evaluation suggests that our technique sets a new state-of-the-art for unsupervised graph matching.