Appendices A Masking distribution

Neural Information Processing Systems 

For a 15 sec long audio sample, the average mask length is 14.7 time-steps, corresponding to 299ms Table 6 summarizes the fine-tuning hyper-parameter settings used for the different labeled data setup. In this section we study the most common errors our models make when fine-tuned on different amounts of labeled data (Table 11). L V -60k model achieves WER 38.3 on dev-clean and adding a Transformer language model enables The ten minute models without lexicon and language model tend to spell words phonetically and omit repeated letters, e.g., will At ten hours, top errors include articles, e.g., a, the which The "from scratch" 960 hour model has a similar word error rate as the 100 hour pre-trained model In brackets is the total number of occurrences of each error. The setup for the baseline model is described in 5.4. Both did not lead to meaningful improvements.

Similar Docs  Excel Report  more

TitleSimilaritySource
None found