Goto

Collaborating Authors

 mask2former




Mining

Neural Information Processing Systems

We have conducted the experiments of replacing proposal generator, including MaskFormer [3] and RPN in Mask R-CNN combined with class-agnostic segmentation head [6, 7] (denote as RPN+Seghead). We also conduct the results for generating different numbers of proposals (N) with Mask2Former. Note that the original setting of MicroSeg is Mask2Former (N = 100).


GNeSF: Generalizable Neural Semantic Fields Supplementary Material

Neural Information Processing Systems

To extract features from these source views, we employ a network with shared weights. Each vertex of the feature volume grids is projected to the image feature maps and obtains its image features by interpolation. In comparison, our method is able to segment accurately for various scenes. In several instances, our method correctly segments objects while Mask2Former produces incorrect results. Mask2Former fails to segment some objects such as the table in the third row. We show more qualitative comparison with NeuralRecon in Figure 1.




GVCCS: A Dataset for Contrail Identification and Tracking on Visible Whole Sky Camera Sequences

Jarry, Gabriel, Dalmau, Ramon, Very, Philippe, Ballerini, Franck, Bocu, Stefania-Denisa

arXiv.org Artificial Intelligence

Aviation's climate impact includes not only CO2 emissions but also significant non-CO2 effects, especially from contrails. These ice clouds can alter Earth's radiative balance, potentially rivaling the warming effect of aviation CO2. Physics-based models provide useful estimates of contrail formation and climate impact, but their accuracy depends heavily on the quality of atmospheric input data and on assumptions used to represent complex processes like ice particle formation and humidity-driven persistence. Observational data from remote sensors, such as satellites and ground cameras, could be used to validate and calibrate these models. However, existing datasets don't explore all aspect of contrail dynamics and formation: they typically lack temporal tracking, and do not attribute contrails to their source flights. To address these limitations, we present the Ground Visible Camera Contrail Sequences (GVCCS), a new open data set of contrails recorded with a ground-based all-sky camera in the visible range. Each contrail is individually labeled and tracked over time, allowing a detailed analysis of its lifecycle. The dataset contains 122 video sequences (24,228 frames) and includes flight identifiers for contrails that form above the camera. As reference, we also propose a unified deep learning framework for contrail analysis using a panoptic segmentation model that performs semantic segmentation (contrail pixel identification), instance segmentation (individual contrail separation), and temporal tracking in a single architecture. By providing high-quality, temporally resolved annotations and a benchmark for model evaluation, our work supports improved contrail monitoring and will facilitate better calibration of physical models. This sets the groundwork for more accurate climate impact understanding and assessments.


Vision-Guided Loco-Manipulation with a Snake Robot

Salagame, Adarsh, Potluri, Sasank, Vaidyanathan, Keshav Bharadwaj, Gangaraju, Kruthika, Sihite, Eric, Ramezani, Milad, Ramezani, Alireza

arXiv.org Artificial Intelligence

This paper presents the development and integration of a vision-guided loco-manipulation pipeline for Northeastern University's snake robot, COBRA. The system leverages a YOLOv8-based object detection model and depth data from an onboard stereo camera to estimate the 6-DOF pose of target objects in real time. We introduce a framework for autonomous detection and control, enabling closed-loop loco-manipulation for transporting objects to specified goal locations. Additionally, we demonstrate open-loop experiments in which COBRA successfully performs real-time object detection and loco-manipulation tasks.


Spike2Former: Efficient Spiking Transformer for High-performance Image Segmentation

Lei, Zhenxin, Yao, Man, Hu, Jiakui, Luo, Xinhao, Lu, Yanye, Xu, Bo, Li, Guoqi

arXiv.org Artificial Intelligence

Spiking Neural Networks (SNNs) have a low-power advantage but perform poorly in image segmentation tasks. The reason is that directly converting neural networks with complex architectural designs for segmentation tasks into spiking versions leads to performance degradation and non-convergence. To address this challenge, we first identify the modules in the architecture design that lead to the severe reduction in spike firing, make targeted improvements, and propose Spike2Former architecture. Second, we propose normalized integer spiking neurons to solve the training stability problem of SNNs with complex architectures. We set a new state-of-the-art for SNNs in various semantic segmentation datasets, with a significant improvement of +12.7% mIoU and 5.0 efficiency on ADE20K, +14.3% mIoU and 5.2 efficiency on VOC2012, and +9.1% mIoU and 6.6 efficiency on CityScapes.


Benchmarking Pretrained Attention-based Models for Real-Time Recognition in Robot-Assisted Esophagectomy

de Jong, Ronald L. P. D., Khalil, Yasmina al, Jaspers, Tim J. M., van Jaarsveld, Romy C., Kuiper, Gino M., Li, Yiping, van Hillegersberg, Richard, Ruurda, Jelle P., Breeuwer, Marcel, van der Sommen, Fons

arXiv.org Artificial Intelligence

Esophageal cancer is among the most common types of cancer worldwide. It is traditionally treated using open esophagectomy, but in recent years, robot-assisted minimally invasive esophagectomy (RAMIE) has emerged as a promising alternative. However, robot-assisted surgery can be challenging for novice surgeons, as they often suffer from a loss of spatial orientation. Computer-aided anatomy recognition holds promise for improving surgical navigation, but research in this area remains limited. In this study, we developed a comprehensive dataset for semantic segmentation in RAMIE, featuring the largest collection of vital anatomical structures and surgical instruments to date. Handling this diverse set of classes presents challenges, including class imbalance and the recognition of complex structures such as nerves. This study aims to understand the challenges and limitations of current state-of-the-art algorithms on this novel dataset and problem. Therefore, we benchmarked eight real-time deep learning models using two pretraining datasets. We assessed both traditional and attention-based networks, hypothesizing that attention-based networks better capture global patterns and address challenges such as occlusion caused by blood or other tissues. The benchmark includes our RAMIE dataset and the publicly available CholecSeg8k dataset, enabling a thorough assessment of surgical segmentation tasks. Our findings indicate that pretraining on ADE20k, a dataset for semantic segmentation, is more effective than pretraining on ImageNet. Furthermore, attention-based models outperform traditional convolutional neural networks, with SegNeXt and Mask2Former achieving higher Dice scores, and Mask2Former additionally excelling in average symmetric surface distance.