Jain, Amit
Visual Language Models as Operator Agents in the Space Domain
Carrasco, Alejandro, Nedungadi, Marco, Zucchelli, Enrico M., Jain, Amit, Rodriguez-Fernandez, Victor, Linares, Richard
Since the emergence of the LLM trend, initiated with the first release of ChatGPT [1], these systems have undergone continuous development and have evolved into multimodal architectures. Multimodal models, such as GPT-4o [2], LLaMA 3.2 [3] and Claude with its latest 3.5 Sonnet model [4], integrate language understanding with non-language capabilities, including vision and audio processing. This progression unlocks new opportunities for developing intelligent agents capable of recognizing and interpreting patterns not only at a semantic level but also through components that can incorporate other types of unstructured data into prompts, significantly expanding their potential applications and impact. Extending these capabilities, Vision-Language Models (VLMs) build on multimodal principles by integrating visual reasoning into the LLM framework. By introducing new tokens into the prompts to process image frames, VLMs enable simultaneous semantic and visual reasoning. This enhancement is particularly valuable in dynamic applications like robotics, where the integration of vision and language reasoning enables systems to generate environment-responsive actions. Such actions, often described as descriptive policies, translate reasoning into meaningful, executable commands. Language models able to generate such commands are usually referred to as "agentic".
Data-Driven Shape Sensing in Continuum Manipulators via Sliding Resistive Flex Sensors
Zhang, Chenhan, Jiang, Shaopeng, Wang, Heyun, Liu, Joshua, Jain, Amit, Armand, Mehran
We introduce a novel shape-sensing method using Resistive Flex Sensors (RFS) embedded in cable-driven Continuum Dexterous Manipulators (CDMs). The RFS is predominantly sensitive to deformation rather than direct forces, making it a distinctive tool for shape sensing. The RFS unit we designed is a considerably less expensive and robust alternative, offering comparable accuracy and real-time performance to existing shape sensing methods used for the CDMs proposed for minimally-invasive surgery. Our design allows the RFS to move along and inside the CDM conforming to its curvature, offering the ability to capture resistance metrics from various bending positions without the need for elaborate sensor setups. The RFS unit is calibrated using an overhead camera and a ResNet machine learning framework. Experiments using a 3D printed prototype of the CDM achieved an average shape estimation error of 0.968 mm with a standard error of 0.275 mm. The response time of the model was approximately 1.16 ms, making real-time shape sensing feasible. While this preliminary study successfully showed the feasibility of our approach for C-shape CDM deformations with non-constant curvatures, we are currently extending the results to show the feasibility for adapting to more complex CDM configurations such as S-shape created in obstructed environments or in presence of the external forces.
Data Filtering Networks
Fang, Alex, Jose, Albin Madappally, Jain, Amit, Schmidt, Ludwig, Toshev, Alexander, Shankar, Vaishaal
Large training sets have become a cornerstone of machine learning and are the foundation for recent advances in language modeling and multimodal learning. While data curation for pre-training is often still ad-hoc, one common paradigm is to first collect a massive pool of data from the Web and then filter this candidate pool down to an actual training set via various heuristics. In this work, we study the problem of learning a data filtering network (DFN) for this second step of filtering a large uncurated dataset. Our key finding is that the quality of a network for filtering is distinct from its performance on downstream tasks: for instance, a model that performs well on ImageNet can yield worse training sets than a model with low ImageNet accuracy that is trained on a small amount of high-quality data. Based on our insights, we construct new data filtering networks that induce state-of-the-art image-text datasets. Specifically, our best performing dataset DFN-5B enables us to train state-of-the-art CLIP models for their compute budgets: among other improvements on a variety of tasks, a ViT-H trained on our dataset achieves 84.4% zero-shot transfer accuracy on ImageNet, out-performing models trained on other datasets such as LAION-2B, DataComp-1B, or OpenAI's WIT. In order to facilitate further research in dataset design, we also release a new 2 billion example dataset DFN-2B and show that high performance data filtering networks can be trained from scratch using only publicly available data.