This is an excellent question to start off an automatic speech recognition (ASR) interview. I would slightly rephrase the question as "Why is speech recognition hard?" An ASR is just like any other machine learning (ML) problem, where the objective is to classify a sound wave into one of the basic units of speech (also called a "class" in ML terminology), such as a word. The problem with human speech is the huge amount of variation that occurs while pronouncing a word. For example, below are two recordings of the word "Yes" spoken by the same person (wave source: AN4 dataset ).
Dec-28-2016, 20:55:02 GMT