A Transformer-based representation-learning model with unified processing of multimodal input for clinical diagnostics