Goto

Collaborating Authors

 Dhanbad



GRIM: Task-Oriented Grasping with Conditioning on Generative Examples

Shailesh, null, Raj, Alok, Kumar, Nayan, Shukla, Priya, Melnik, Andrew, Beetz, Michael, Nandi, Gora Chand

arXiv.org Artificial Intelligence

Task-Oriented Grasping (TOG) requires robots to select grasps that are functionally appropriate for a specified task - a challenge that demands an understanding of task semantics, object affordances, and functional constraints. We present GRIM (Grasp Re-alignment via Iterative Matching), a training-free framework that addresses these challenges by leveraging Video Generation Models (VGMs) together with a retrieve-align-transfer pipeline. Beyond leveraging VGMs, GRIM can construct a memory of object-task exemplars sourced from web images, human demonstrations, or generative models. The retrieved task-oriented grasp is then transferred and refined by evaluating it against a set of geometrically stable candidate grasps to ensure both functional suitability and physical feasibility. GRIM demonstrates strong generalization and achieves state-of-the-art performance on standard TOG benchmarks. Project website: https://grim-tog.github.io


Search-TTA: A Multimodal Test-Time Adaptation Framework for Visual Search in the Wild

Tan, Derek Ming Siang, Shailesh, null, Liu, Boyang, Raj, Alok, Ang, Qi Xuan, Dai, Weiheng, Duhan, Tanishq, Chiun, Jimmy, Cao, Yuhong, Shkurti, Florian, Sartoretti, Guillaume

arXiv.org Artificial Intelligence

To perform outdoor visual navigation and search, a robot may leverage satellite imagery to generate visual priors. This can help inform high-level search strategies, even when such images lack sufficient resolution for target recognition. However, many existing informative path planning or search-based approaches either assume no prior information, or use priors without accounting for how they were obtained. Recent work instead utilizes large Vision Language Models (VLMs) for generalizable priors, but their outputs can be inaccurate due to hallucination, leading to inefficient search. To address these challenges, we introduce Search-TTA, a multimodal test-time adaptation framework with a flexible plug-and-play interface compatible with various input modalities (e.g., image, text, sound) and planning methods (e.g., RL-based). First, we pretrain a satellite image encoder to align with CLIP's visual encoder to output probability distributions of target presence used for visual search. Second, our TTA framework dynamically refines CLIP's predictions during search using uncertainty-weighted gradient updates inspired by Spatial Poisson Point Processes. To train and evaluate Search-TTA, we curate AVS-Bench, a visual search dataset based on internet-scale ecological data containing 380k images and taxonomy data. We find that Search-TTA improves planner performance by up to 30.0%, particularly in cases with poor initial CLIP predictions due to domain mismatch and limited training data. It also performs comparably with significantly larger VLMs, and achieves zero-shot generalization via emergent alignment to unseen modalities. Finally, we deploy Search-TTA on a real UAV via hardware-in-the-loop testing, by simulating its operation within a large-scale simulation that provides onboard sensing.


Game-Theoretic Resilience Framework for Cyber-Physical Microgrids using Multi-Agent Reinforcement Learning

Niketh, S Krishna, Mitikiri, Sagar Babu, Vignesh, V, Srinivas, Vedantham Lakshmi, Pal, Mayukha

arXiv.org Artificial Intelligence

The increasing reliance on cyber physical infrastructure in modern power systems has amplified the risk of targeted cyber attacks, necessitating robust and adaptive resilience strategies. This paper presents a mathematically rigorous game theoretic framework to evaluate and enhance microgrid resilience using a combination of quantitative resilience metrics Load Served Ratio LSR, Critical Load Resilience CLR, Topological Survivability Score TSS, and DER Resilience Score DRS. These are integrated into a unified payoff matrix using the Analytic Hierarchy Process AHP to assess attack defense interactions. The framework is formalized as a finite horizon Markov Decision Process MDP with formal convergence guarantees and computational complexity bounds. Three case studies are developed 1. static attacks analyzed via Nash equilibrium, 2. severe attacks incorporating high impact strategies, and 3. adaptive attacks using Stackelberg games, regret matching, softmax heuristics, and Multi Agent Q Learning. Rigorous theoretical analysis provides convergence proofs with explicit rates , PAC learning sample complexity bounds, and computational complexity analysis. The framework is tested on an enhanced IEEE 33bus distribution system with DERs and control switches, demonstrating the effectiveness of adaptive and strategic defenses in improving cyber physical resilience with statistically significant improvements of 18.7% 2.1% over static approaches.


Online Learning for Approximately-Convex Functions with Long-term Adversarial Constraints

Sarkar, Dhruv, Mukhopadhyay, Samrat, Sinha, Abhishek

arXiv.org Artificial Intelligence

We study an online learning problem with long-term budget constraints in the adversarial setting. In this problem, at each round $t$, the learner selects an action from a convex decision set, after which the adversary reveals a cost function $f_t$ and a resource consumption function $g_t$. The cost and consumption functions are assumed to be $α$-approximately convex - a broad class that generalizes convexity and encompasses many common non-convex optimization problems, including DR-submodular maximization, Online Vertex Cover, and Regularized Phase Retrieval. The goal is to design an online algorithm that minimizes cumulative cost over a horizon of length $T$ while approximately satisfying a long-term budget constraint of $B_T$. We propose an efficient first-order online algorithm that guarantees $O(\sqrt{T})$ $α$-regret against the optimal fixed feasible benchmark while consuming at most $O(B_T \log T)+ \tilde{O}(\sqrt{T})$ resources in both full-information and bandit feedback settings. In the bandit feedback setting, our approach yields an efficient solution for the $\texttt{Adversarial Bandits with Knapsacks}$ problem with improved guarantees. We also prove matching lower bounds, demonstrating the tightness of our results. Finally, we characterize the class of $α$-approximately convex functions and show that our results apply to a broad family of problems.



pFedSOP : Accelerating Training Of Personalized Federated Learning Using Second-Order Optimization

Sen, Mrinmay, Mohan, Chalavadi Krishna

arXiv.org Artificial Intelligence

--Personalized Federated Learning (PFL) enables clients to collaboratively train personalized models tailored to their individual objectives, addressing the challenge of model generalization in traditional Federated Learning (FL) due to high data heterogeneity. However, existing PFL methods often require increased communication rounds to achieve the desired performance, primarily due to slow training caused by the use of first-order optimization, which has linear convergence. Additionally, many of these methods increase local computation because of the additional data fed into the model during the search for personalized local models. One promising solution to this slow training is second-order optimization, known for its quadratic convergence. However, employing it in PFL is challenging due to the Hessian matrix and its inverse. In this paper, we propose pFedSOP, which efficiently utilizes second-order optimization in PFL to accelerate the training of personalized models and enhance performance with fewer communication rounds. Our approach first computes a personalized local gradient update using the Gompertz function-based normalized angle between local and global gradient updates, incorporating client-specific global information. We then use a regularized Fisher Information Matrix (FIM), computed from this personalized gradient update, as an approximation of the Hessian to update the personalized models. This FIM-based second-order optimization speeds up training with fewer communication rounds by tackling the challenges with exact Hessian and avoids additional data being fed into the model during the search for personalized local models. Extensive experiments on heterogeneously partitioned image classification datasets with partial client participation demonstrate that pFedSOP outperforms state-of-the-art FL and PFL algorithms.


Behind Maya: Building a Multilingual Vision Language Model

Alam, Nahid, Kanjula, Karthik Reddy, Guthikonda, Surya, Chung, Timothy, Vegesna, Bala Krishna S, Das, Abhipsha, Susevski, Anthony, Chan, Ryan Sze-Yin, Uddin, S M Iftekhar, Islam, Shayekh Bin, Santhosh, Roshan, A, Snegha, Sharma, Drishti, Liu, Chen, Chaturvedi, Isha, Winata, Genta Indra, S, Ashvanth., Mukherjee, Snehanshu, Aji, Alham Fikri

arXiv.org Artificial Intelligence

In recent times, we have seen a rapid development of large Vision-Language Models (VLMs). They have shown impressive results on academic benchmarks, primarily in widely spoken languages but lack performance on low-resource languages and varied cultural contexts. T o address these limitations, we introduce Maya, an open-source Multilingual VLM. Our contributions are: 1) a multilingual image-text pretraining dataset in eight languages, based on the LLaVA pretraining dataset; and 2) a multilingual image-text model supporting these languages, enhancing cultural and linguistic comprehension in vision-language tasks.


Towards Smarter Hiring: Are Zero-Shot and Few-Shot Pre-trained LLMs Ready for HR Spoken Interview Transcript Analysis?

Maity, Subhankar, Deroy, Aniket, Sarkar, Sudeshna

arXiv.org Artificial Intelligence

This research paper presents a comprehensive analysis of the performance of prominent pre-trained large language models (LLMs), including GPT-4 Turbo, GPT-3.5 Turbo, text-davinci-003, text-babbage-001, text-curie-001, text-ada-001, llama-2-7b-chat, llama-2-13b-chat, and llama-2-70b-chat, in comparison to expert human evaluators in providing scores, identifying errors, and offering feedback and improvement suggestions to candidates during mock HR (Human Resources) interviews. We introduce a dataset called HURIT (Human Resource Interview Transcripts), which comprises 3,890 HR interview transcripts sourced from real-world HR interview scenarios. Our findings reveal that pre-trained LLMs, particularly GPT-4 Turbo and GPT-3.5 Turbo, exhibit commendable performance and are capable of producing evaluations comparable to those of expert human evaluators. Although these LLMs demonstrate proficiency in providing scores comparable to human experts in terms of human evaluation metrics, they frequently fail to identify errors and offer specific actionable advice for candidate performance improvement in HR interviews. Our research suggests that the current state-of-the-art pre-trained LLMs are not fully conducive for automatic deployment in an HR interview assessment. Instead, our findings advocate for a human-in-the-loop approach, to incorporate manual checks for inconsistencies and provisions for improving feedback quality as a more suitable strategy.


Optimizing Multi-DNN Inference on Mobile Devices through Heterogeneous Processor Co-Execution

Gao, Yunquan, Zhang, Zhiguo, Donta, Praveen Kumar, Dehury, Chinmaya Kumar, Wang, Xiujun, Niyato, Dusit, Zhang, Qiyang

arXiv.org Artificial Intelligence

Abstract--Deep Neural Networks (DNNs) are increasingly deployed across diverse industries, driving a growing demand to enable their capabilities on mobile devices. However, existing mobile inference frameworks are often rely on a single processor to handle each model's inference, limiting hardware utilization and leading to suboptimal performance and energy efficiency . Expanding DNNs accessibility on mobile platforms requires more adaptive and resource-efficient solutions to meet increasing computational demands without compromising device functionality . Nevertheless, parallel inference of multiple DNNs on heterogeneous processors remains a significant challenge. Several works have explored partitioning DNN operations into subgraphs to enable parallel execution across heterogeneous processors. However, these approaches typically generate excessive subgraphs based solely on hardware compatibility, increasing scheduling complexity and memory management overhead. T o address these limitations, we propose an Advanced Multi-DNN Model Scheduling (ADMS) strategy that optimizes multi-DNN inference across heterogeneous processors on mobile devices. ADMS constructs an optimal subgraph partitioning strategy offline, considering both hardware support of operations and scheduling granularity, while employing a processor-state-aware scheduling algorithm that dynamically balances workloads based on real-time operational conditions. This ensures efficient workload distribution and maximizes the utilization of available processors. Experimental results show that, compared to vanilla inference frameworks, ADMS reduced multi-DNN inference latency by 4.04 T o reduce interaction latency and lower server-side computing costs, an increasing number of applications are shifting inference tasks to mobile devices. In many real-world scenarios, multiple independent or related DNN models run concurrently on mobile devices. For instance, in the smart agriculture scenario, farmers capture video frames using smartphone camera and perform real-time parallel inference with multiple DNN models. These models include crop identification [5], pest and disease detection [6], plant health assessment [7], and soil quality analysis [8]. Gao, X. Wang are with School of Computer Science and T echnology, Anhui Engineering Research Center for Intelligent Applications and Security of Industrial Internet, Anhui University of T echnology, Ma'anshan, Anhui, 243032, China.