Unberath, Mathias
Robustness in Deep Learning for Computer Vision: Mind the gap?
Drenkow, Nathan, Sani, Numair, Shpitser, Ilya, Unberath, Mathias
Deep neural networks for computer vision tasks are deployed in increasingly safety-critical and socially-impactful applications, motivating the need to close the gap in model performance under varied, naturally occurring imaging conditions. Robustness, ambiguously used in multiple contexts including adversarial machine learning, here then refers to preserving model performance under naturally-induced image corruptions or alterations. We perform a systematic review to identify, analyze, and summarize current definitions and progress towards non-adversarial robustness in deep learning for computer vision. We find that this area of research has received disproportionately little attention relative to adversarial machine learning, yet a significant robustness gap exists that often manifests in performance degradation similar in magnitude to adversarial conditions. To provide a more transparent definition of robustness across contexts, we introduce a structural causal model of the data generating process and interpret non-adversarial robustness as pertaining to a model's behavior on corrupted images which correspond to low-probability samples from the unaltered data distribution. We then identify key architecture-, data augmentation-, and optimization tactics for improving neural network robustness. This causal view of robustness reveals that common practices in the current literature, both in regards to robustness tactics and evaluations, correspond to causal concepts, such as soft interventions resulting in a counterfactually-altered distribution of imaging conditions. Through our findings and analysis, we offer perspectives on how future research may mind this evident and significant non-adversarial robustness gap.
An Interpretable Algorithm for Uveal Melanoma Subtyping from Whole Slide Cytology Images
Chen, Haomin, Liu, T. Y. Alvin, Gomez, Catalina, Correa, Zelia, Unberath, Mathias
Algorithmic decision support is rapidly becoming a staple of personalized medicine, especially for high-stakes recommendations in which access to certain information can drastically alter the course of treatment, and thus, patient outcome; a prominent example is radiomics for cancer subtyping. Because in these scenarios the stakes are high, it is desirable for decision systems to not only provide recommendations but supply transparent reasoning in support thereof. For learning-based systems, this can be achieved through an interpretable design of the inference pipeline. Herein we describe an automated yet interpretable system for uveal melanoma subtyping with digital cytology images from fine needle aspiration biopsies. Our method embeds every automatically segmented cell of a candidate cytology image as a point in a 2D manifold defined by many representative slides, which enables reasoning about the cell-level composition of the tissue sample, paving the way for interpretable subtyping of the biopsy. Finally, a rule-based slide-level classification algorithm is trained on the partitions of the circularly distorted 2D manifold. This process results in a simple rule set that is evaluated automatically but highly transparent for human verification. On our in house cytology dataset of 88 uveal melanoma patients, the proposed method achieves an accuracy of 87.5% that compares favorably to all competing approaches, including deep "black box" models. The method comes with a user interface to facilitate interaction with cell-level content, which may offer additional insights for pathological assessment.
E-DSSR: Efficient Dynamic Surgical Scene Reconstruction with Transformer-based Stereoscopic Depth Perception
Long, Yonghao, Li, Zhaoshuo, Yee, Chi Hang, Ng, Chi Fai, Taylor, Russell H., Unberath, Mathias, Dou, Qi
Reconstructing the scene of robotic surgery from the stereo endoscopic video is an important and promising topic in surgical data science, which potentially supports many applications such as surgical visual perception, robotic surgery education and intra-operative context awareness. However, current methods are mostly restricted to reconstructing static anatomy assuming no tissue deformation, tool occlusion and de-occlusion, and camera movement. However, these assumptions are not always satisfied in minimal invasive robotic surgeries. In this work, we present an efficient reconstruction pipeline for highly dynamic surgical scenes that runs at 28 fps. Specifically, we design a transformer-based stereoscopic depth perception for efficient depth estimation and a light-weight tool segmentor to handle tool occlusion. After that, a dynamic reconstruction algorithm which can estimate the tissue deformation and camera movement, and aggregate the information over time is proposed for surgical scene reconstruction. We evaluate the proposed pipeline on two datasets, the public Hamlyn Centre Endoscopic Video Dataset and our in-house DaVinci robotic surgery dataset. The results demonstrate that our method can recover the scene obstructed by the surgical tool and handle the movement of camera in realistic surgical scenarios effectively at real-time speed.
Relational Graph Learning on Visual and Kinematics Embeddings for Accurate Gesture Recognition in Robotic Surgery
Long, Yong-Hao, Wu, Jie-Ying, Lu, Bo, Jin, Yue-Ming, Unberath, Mathias, Liu, Yun-Hui, Heng, Pheng-Ann, Dou, Qi
Automatic surgical gesture recognition is fundamentally important to enable intelligent cognitive assistance in robotic surgery. With recent advancement in robot-assisted minimally invasive surgery, rich information including surgical videos and robotic kinematics can be recorded, which provide complementary knowledge for understanding surgical gestures. However, existing methods either solely adopt uni-modal data or directly concatenate multi-modal representations, which can not sufficiently exploit the informative correlations inherent in visual and kinematics data to boost gesture recognition accuracies. In this regard, we propose a novel approach of multimodal relational graph network (i.e., MRG-Net) to dynamically integrate visual and kinematics information through interactive message propagation in the latent feature space. In specific, we first extract embeddings from video and kinematics sequences with temporal convolutional networks and LSTM units. Next, we identify multi-relations in these multi-modal features and model them through a hierarchical relational graph learning module. The effectiveness of our method is demonstrated with state-of-the-art results on the public JIGSAWS dataset, outperforming current uni-modal and multi-modal methods on both suturing and knot typing tasks. Furthermore, we validated our method on in-house visual-kinematics datasets collected with da Vinci Research Kit (dVRK) platforms in two centers, with consistent promising performance achieved.
Artificial Intelligence-based Clinical Decision Support for COVID-19 -- Where Art Thou?
Unberath, Mathias, Ghobadi, Kimia, Levin, Scott, Hinson, Jeremiah, Hager, Gregory D
Prior to January 2020, the artificial intelligence and machine learning (AI/ML) for healthcare community had many reasons to be pleased with the recent progress of their field. Learning-based algorithms had been shown to accurately forecast the onset of septic shock [1], MLbased pattern recognition methods classified skin lesions with dermatologist level accuracy [2], diagnostic AI systems successfully identified diabetic retinopathy during routine primary care visits [3], AIbased breast cancer screening outperformed radiologists by a fairly large margin [4], MLdriven triaging tools improved outcome differentiation beyond the emergency severity index [5], AIenabled assistance systems simplified interventional workflows [6], and algorithm-driven organizational studies enabled redesign of infusion centers [7]. Many would have argued that, after nearly 60 years on the test bench [8], AI in healthcare had finally reached a level of maturity, performance, and reliability that was compatible with the unforgiving requirements imposed by clinical practice. Today, only a few months later, this rather sunny outlook has become overcast. The worlds healthcare systems are facing the outbreak of a novel respiratory disease, COVID-19.
Self-supervised Learning for Dense Depth Estimation in Monocular Endoscopy
Liu, Xingtong, Sinha, Ayushi, Ishii, Masaru, Hager, Gregory D., Reiter, Austin, Taylor, Russell H., Unberath, Mathias
INIMALLY invasiveprocedures in the head and neck, e. g. functional endoscopic sinus surgery, typically employ surgical navigation systems to provide surgeons with additional anatomical and positional information. This helps them avoid critical structures, such as the brain, eyes, and major arteries, that are spatially close to the sinus cavities and must not be disturbed during surgery. Computer vision-based navigation systems that rely on the intra-operative endoscopic video stream and do not introduce additional hardware are both easy to integrate into clinical workflow and cost-effective. Such systems generally require registration of preoperative data, such as CT scans or statistical models, to the intraoperative videodata [1], [2], [3], [4]. This registration must be highly accurate in order to guarantee reliable performance of the navigation system. To enable an accurate registration, a feature-based video-CT registration algorithm requires accurate andsufficiently dense intra-operative 3D reconstructions of the anatomy from endoscopic videos.