Mutter, Didier
CholecTriplet2022: Show me a tool and tell me the triplet -- an endoscopic vision challenge for surgical action triplet detection
Nwoye, Chinedu Innocent, Yu, Tong, Sharma, Saurav, Murali, Aditya, Alapatt, Deepak, Vardazaryan, Armine, Yuan, Kun, Hajek, Jonas, Reiter, Wolfgang, Yamlahi, Amine, Smidt, Finn-Henri, Zou, Xiaoyang, Zheng, Guoyan, Oliveira, Bruno, Torres, Helena R., Kondo, Satoshi, Kasai, Satoshi, Holm, Felix, Özsoy, Ege, Gui, Shuangchun, Li, Han, Raviteja, Sista, Sathish, Rachana, Poudel, Pranav, Bhattarai, Binod, Wang, Ziheng, Rui, Guo, Schellenberg, Melanie, Vilaça, João L., Czempiel, Tobias, Wang, Zhenkun, Sheet, Debdoot, Thapa, Shrawan Kumar, Berniker, Max, Godau, Patrick, Morais, Pedro, Regmi, Sudarshan, Tran, Thuy Nuong, Fonseca, Jaime, Nölke, Jan-Hinrich, Lima, Estevão, Vazquez, Eduard, Maier-Hein, Lena, Navab, Nassir, Mascagni, Pietro, Seeliger, Barbara, Gonzalez, Cristians, Mutter, Didier, Padoy, Nicolas
Formalizing surgical activities as triplets of the used instruments, actions performed, and target anatomies is becoming a gold standard approach for surgical activity modeling. The benefit is that this formalization helps to obtain a more detailed understanding of tool-tissue interaction which can be used to develop better Artificial Intelligence assistance for image-guided surgery. Earlier efforts and the CholecTriplet challenge introduced in 2021 have put together techniques aimed at recognizing these triplets from surgical footage. Estimating also the spatial locations of the triplets would offer a more precise intraoperative context-aware decision support for computer-assisted intervention. This paper presents the CholecTriplet2022 challenge, which extends surgical action triplet modeling from recognition to detection. It includes weakly-supervised bounding box localization of every visible surgical instrument (or tool), as the key actors, and the modeling of each tool-activity in the form of
Learning from a tiny dataset of manual annotations: a teacher/student approach for surgical phase recognition
Yu, Tong, Mutter, Didier, Marescaux, Jacques, Padoy, Nicolas
Vision algorithms capable of interpreting scenes from a real-time video stream are necessary for computer-assisted surgery systems to achieve context-aware behavior. In laparoscopic procedures one particular algorithm needed for such systems is the identification of surgical phases, for which the current state of the art is a model based on a CNN-LSTM. A number of previous works using models of this kind have trained them in a fully supervised manner, requiring a fully annotated dataset. Instead, our work confronts the problem of learning surgical phase recognition in scenarios presenting scarce amounts of annotated data (under 25% of all available video recordings). We propose a teacher/student type of approach, where a strong predictor called the teacher, trained beforehand on a small dataset of ground truth-annotated videos, generates synthetic annotations for a larger dataset, which another model - the student - learns from. In our case, the teacher features a novel CNN-biLSTM-CRF architecture, designed for offline inference only. The student, on the other hand, is a CNN-LSTM capable of making real-time predictions. Results for various amounts of manually annotated videos demonstrate the superiority of the new CNN-biLSTM-CRF predictor as well as improved performance from the CNN-LSTM trained using synthetic labels generated for unannotated videos. For both offline and online surgical phase recognition with very few annotated recordings available, this new teacher/student strategy provides a valuable performance improvement by efficiently leveraging the unannotated data.