Goto

Collaborating Authors

 strawberry


DexFruit: Dexterous Manipulation and Gaussian Splatting Inspection of Fruit

Swann, Aiden, Qiu, Alex, Strong, Matthew, Zhang, Angelina, Morstein, Samuel, Rayle, Kai, Kennedy, Monroe III

arXiv.org Artificial Intelligence

Abstract--DexFruit is a robotic manipulation framework that enables gentle, autonomous handling of fragile fruit and precise evaluation of damage. Soft fruits have long faced an issue of produce loss in both the harvesting and post-harvesting processes due to their extreme fragility and susceptibility to bruising, making them one of the hardest produce type to manipulate with automation. In this work, we demonstrate by using optical tactile sensing, autonomous manipulation of fruit with minimal damage can be achieved. We show that our tactile informed diffusion policies outperform baselines in both reduced bruising and pick-and-place success rate across three fruits: strawberries, tomatoes, and blackberries. In addition, we introduce FruitSplat, a novel technique to represent and quantify visual damage in a high-resolution 3D representation via 3D Gaussian Splatting (3DGS). Existing metrics for measuring damage lack quantitative rigor or require expensive equipment. Furthermore, this representation is modular and general, compatible with any relevant 2D model. Overall, we demonstrate a 92% grasping policy success rate, up to a 15% reduction in visual bruising, and up to a 31% improvement in grasp success rate on challenging fruit compared to our baselines across our three tested fruits. We rigorously evaluate this result with over 630 trials. Please checkout our website, which contains our code and datasets at https://dex-fruit.github.io/. To address these impending issues, the agricultural industry has taken many strides into increased applications of machinery and automation [4, 5].


6D Strawberry Pose Estimation: Real-time and Edge AI Solutions Using Purely Synthetic Training Data

Sinha, Saptarshi Neil, Kühn, Julius, Goschke, Mika Silvan, Weinmann, Michael

arXiv.org Artificial Intelligence

Automated and selective harvesting of fruits has become an important area of research, particularly due to challenges such as high costs and a shortage of seasonal labor in advanced economies. This paper focuses on 6D pose estimation of strawberries using purely synthetic data generated through a procedural pipeline for photorealistic rendering. W e employ the YOLOX-6D-Pose algorithm, a single-shot approach that leverages the YOLOX backbone, known for its balance between speed and accuracy, and its support for edge inference. T o address the lacking availability of training data, we introduce a robust and flexible pipeline for generating synthetic strawberry data from various 3D models via a procedural Blender pipeline, where we focus on enhancing the realism of the synthesized data in comparison to previous work to make it a valuable resource for training pose estimation algorithms. Quantitative evaluations indicate that our models achieve comparable accuracy on both the NVIDIA RTX 3090 and Jetson Orin Nano across several ADD-S metrics, with the RTX 3090 demonstrating superior processing speed. However, the Jetson Orin Nano is particularly suited for resource-constrained environments, making it an excellent choice for deployment in agricultural robotics. Qualitative assessments further confirm the model's performance, demonstrating its capability to accurately infer the poses of ripe and partially ripe strawberries, while facing challenges in detecting unripe specimens. This suggests opportunities for future improvements, especially in enhancing detection capabilities for unripe strawberries (if desired) by exploring variations in color . Furthermore, the methodology presented could be adapted easily for other fruits such as apples, peaches, and plums, thereby expanding its applicability and impact in the field of agricultural automation.


Diverse Image Captioning with Context Object Split Latent Spaces

Neural Information Processing Systems

The word dimension for the embedding layer is 300. In Tab. 7 we further evaluate the diversity of COS-CVAE using self-CIDEr We provide additional qualitative results in Tabs. In Tab. 12 we show the divserse captions for novel objects generated by our model and the regions The evaluation server for nocaps accepts only one caption per image and does not support methods modeling one-to-many relationships for images and captions. In Figure 1 (left) we show the average accuracy and diversity scores again averaged across annotators; in Figure 1 (right) we show the accuracy and diversity scores from each annotator. We find that the captions generated by the COS-CV AE are scored to be more accurate compared to COS-CV AE (paired).


Efficient Force and Stiffness Prediction in Robotic Produce Handling with a Piezoresistive Pressure Sensor

Fairchild, Preston, Chen, Claudia, Tan, Xiaobo

arXiv.org Artificial Intelligence

Abstract: Properly handling del i cate produce with robotic manipulators is a major part of the future role of automation in agricultural harvesting and processing . Grasping with the correct amount of force is crucial in not only ensuring proper grip on the object, but also to avoid damaging or bruising the product . In this work, a flexible pressure sensor that is both low cost and easy to fabricate is integrated with robotic grippers for work ing with produce of varying shape s, sizes, and stiffness es . The sensor is successfully integrated with both a rigid robotic gripper, as well as a pneumatically actuated soft finger. Furthermore, an algorithm is proposed for acce lerated estimation of the steady - state value of the sensor output based on the transient response data, to enable real - time applications. The sensor is shown to be effective in incorporating feedback to correctly grasp objects of unknown sizes and stiffnesses . At the same time, the sensor provid es estimates for these values which can be utilized for identification of qualities such as ripeness levels and bruising . It is also shown to be able to provide force feedback for objects of variable stiffness es . Th is enables future use not only for produce identification, but also for tasks such as quality control and selective distribution based on ripeness levels . Keywords: Robotics, sensing, p roduce handling, grasping Highlights: Low - cost and easy - to - fabricate sensor for easy implementation with a variety of robotic grippers Fast estimation of settled resistance using exponential decay curve fit Measurements of grasping force and stiffness of a held object V arious produce handling features such as ripeness monitoring, bruising detection, and size estimation 1. Introduction: The use of robotic end - effectors for securely grasping objects is a pivotal component in manipulation tasks .


Diverse Image Captioning with Context Object Split Latent Spaces

Neural Information Processing Systems

The word dimension for the embedding layer is 300. In Tab. 7 we further evaluate the diversity of COS-CVAE using self-CIDEr We provide additional qualitative results in Tabs. In Tab. 12 we show the divserse captions for novel objects generated by our model and the regions The evaluation server for nocaps accepts only one caption per image and does not support methods modeling one-to-many relationships for images and captions. In Figure 1 (left) we show the average accuracy and diversity scores again averaged across annotators; in Figure 1 (right) we show the accuracy and diversity scores from each annotator. We find that the captions generated by the COS-CV AE are scored to be more accurate compared to COS-CV AE (paired).


Pixels Versus Priors: Controlling Knowledge Priors in Vision-Language Models through Visual Counterfacts

Golovanevsky, Michal, Rudman, William, Lepori, Michael, Bar, Amir, Singh, Ritambhara, Eickhoff, Carsten

arXiv.org Artificial Intelligence

Multimodal Large Language Models (MLLMs) perform well on tasks such as visual question answering, but it remains unclear whether their reasoning relies more on memorized world knowledge or on the visual information present in the input image. To investigate this, we introduce Visual CounterFact, a new dataset of visually-realistic counterfactuals that put world knowledge priors (e.g, red strawberry) into direct conflict with visual input (e.g, blue strawberry). Using Visual CounterFact, we show that model predictions initially reflect memorized priors, but shift toward visual evidence in mid-to-late layers. This dynamic reveals a competition between the two modalities, with visual input ultimately overriding priors during evaluation. To control this behavior, we propose Pixels Versus Priors (PvP) steering vectors, a mechanism for controlling model outputs toward either world knowledge or visual input through activation-level interventions. On average, PvP successfully shifts 99.3% of color and 80.8% of size predictions from priors to counterfactuals. Together, these findings offer new tools for interpreting and controlling factual behavior in multimodal models.


The First Impression Problem: Internal Bias Triggers Overthinking in Reasoning Models

Dang, Renfei, Li, Zhening, Huang, Shujian, Chen, Jiajun

arXiv.org Artificial Intelligence

Reasoning models often exhibit overthinking, characterized by redundant reasoning steps. We identify \emph{internal bias} elicited by the input question as a key trigger of such behavior. Upon encountering a problem, the model immediately forms a preliminary guess about the answer, which we term an internal bias since it may not be explicitly generated, and it arises without systematic reasoning. When this guess conflicts with its subsequent reasoning, the model tends to engage in excessive reflection, resulting in wasted computation. We validate the association between internal bias and overthinking across multiple models and diverse reasoning tasks. To demonstrate the causal relationship more rigorously, we conduct two counterfactual interventions, showing that removing the input question after the model reduces the redundant reasoning across various complex reasoning tasks, and manually injecting bias affects overthinking accordingly. Further interpretability experiments suggest that excessive attention to the input question serves as a key mechanism through which internal bias influences subsequent reasoning trajectories. Finally, we evaluated several methods aimed at mitigating overthinking, yet the influence of internal bias persisted under all conditions.


Learning to Pick: A Visuomotor Policy for Clustered Strawberry Picking

Fei, Zhenghao, Lu, Wenwu, Hou, Linsheng, Peng, Chen

arXiv.org Artificial Intelligence

Abstract--Strawberries naturally grow in clusters, interwoven with leaves, stems, and other fruits, which frequently leads to occlusion. This inherent growth habit presents a significant challenge for robotic picking, as traditional percept-plan-control systems struggle to reach fruits amid the clutter . Effectively picking an occluded strawberry demands dexterous manipulation to carefully bypass or gently move the surrounding soft objects and precisely access the ideal picking point--located at the stem just above the calyx. T o address this challenge, we introduce a strawberry-picking robotic system that learns from human demonstrations. Our system features a 4-DoF SCARA arm paired with a human teleoperation interface for efficient data collection and leverages an End Pose Assisted Action Chunking Transformer (ACT) to develop a fine-grained visuomotor picking policy. Experiments under various occlusion scenarios demonstrate that our modified approach significantly outperforms the direct implementation of ACT, underscoring its potential for practical application in occluded strawberry picking. LOBAL demand for strawberries, a high-value crop, continues to rise. While China led production in 2023 with 3,336,690 tons, followed by the US with 1,055,963 tons [1], harvesting remains labor-intensive due to the fruit's fragility. This contrasts with mechanized harvesting of crops like corn and wheat.


LLaDA-VLA: Vision Language Diffusion Action Models

Wen, Yuqing, Li, Hebei, Gu, Kefan, Zhao, Yucheng, Wang, Tiancai, Sun, Xiaoyan

arXiv.org Artificial Intelligence

The rapid progress of auto-regressive vision-language models (VLMs) has inspired growing interest in vision-language-action models (VLA) for robotic manipulation. Recently, masked diffusion models, a paradigm distinct from autoregressive models, have begun to demonstrate competitive performance in text generation and multimodal applications, leading to the development of a series of diffusion-based VLMs (d-VLMs). However, leveraging such models for robot policy learning remains largely unexplored. In this work, we present LLaDA-VLA, the first Vision-Language-Diffusion-Action model built upon pretrained d-VLMs for robotic manipulation. To effectively adapt d-VLMs to robotic domain, we introduce two key designs: (1) a localized special-token classification strategy that replaces full-vocabulary classification with special action token classification, reducing adaptation difficulty; (2) a hierarchical action-structured decoding strategy that decodes action sequences hierarchically considering the dependencies within and across actions. Extensive experiments demonstrate that LLaDA-VLA significantly outperforms state-of-the-art VLAs on both simulation and real-world robots.