Reviews: Analyzing Hidden Representations in End-to-End Automatic Speech Recognition Systems
–Neural Information Processing Systems
The authors conduct an analysis of CTC trained acoustic models to determine how information related to phonetic categories is preserved in CTC-based models which directly output graphemes. The work follows a long line of research that has analyzed neural network representations to determine how they model phonemic representations, although to the best of my knowledge this has not been done previously for CTC-based end-to-end architectures. The results and analysis presented by the authors is interesting, although there are some concerns I have with the conclusions that the authors draw that I would like to clarify these points. Please see my detailed comments below. In the paper, the authors conclude that (Line 159--164) "... after the 5th recurrent layer accuracy goes down again. One possible explanation to this may be that higher layers in the model are more sensitive to long distance information that is needed for the speech recognition task, whereas the local information which is needed for classifying phones is better captured in lower layers."
Neural Information Processing Systems
Oct-8-2024, 06:42:30 GMT