Appendices A Masking distribution

Aug-15-2025, 03:46:36 GMT–Neural Information Processing Systems

For a 15 sec long audio sample, the average mask length is 14.7 time-steps, corresponding to 299ms Table 6 summarizes the fine-tuning hyper-parameter settings used for the different labeled data setup. In this section we study the most common errors our models make when fine-tuned on different amounts of labeled data (Table 11). L V -60k model achieves WER 38.3 on dev-clean and adding a Transformer language model enables The ten minute models without lexicon and language model tend to spell words phonetically and omit repeated letters, e.g., will At ten hours, top errors include articles, e.g., a, the which The "from scratch" 960 hour model has a similar word error rate as the 100 hour pre-trained model In brackets is the total number of occurrences of each error. The setup for the baseline model is described in 5.4. Both did not lead to meaningful improvements.

overlap, time step, transf, (16 more...)

Neural Information Processing Systems

Aug-15-2025, 03:46:36 GMT

Conferences PDF

Add feedback

Technology:
- Information Technology > Artificial Intelligence > Machine Learning (1.00)

Duplicate Docs Excel Report

Title
92d1e1eb1cd6f9fba3227870bb6d7f07-Supplemental.pdf

Similar Docs Excel Report more

Title	Similarity	Source
None found