One model to rule them all ? Towards End-to-End Joint Speaker Diarization and Speech Recognition