Learning Multi-modal Representations by Watching Hundreds of Surgical Video Lectures