Towards Word-Level End-to-End Neural Speaker Diarization with Auxiliary Network