Page tree

Versions Compared


  • This line was added.
  • This line was removed.
  • Formatting was changed.


The RNN-transducer has many similarities with CT: their main goals is to solve the forced segmentation alignment problem in speech recognition; they both introduce a “blank” label; they both calculate the probability of all possible paths and aggregate them to get the label sequence. However, their path generation processes and the path probability calculation methods are completely different. This gives rise to the advantages of RNN-transducer over CTC.



Types of errors made by speech recognizers

Though ASR research has come a long way, today's systems are far from being perfect. Speech recognizer are brittle and make errors due to various causes. Most errors made by ASRs ASR systems fall into one of the following categories:

  • Out-of Vocabulary -vocabulary (OOV) errors: Current state of the art speech recognizers are have closed vocabularies. So, This means that they are incapable of recognizing words outside the system‟s their training vocabulary. Besides misrecognition, the presence of an Out of Vocabulary in an out-of-vocabulary word in input utterance causes errors to its neighboring words.Homophone Substitution: These errors are caused the system to err to a similar word in its vocabulary. Special techniques for handling OOV words have been developed for HMM-GMM and neural ASR systems (see, e.g., Zhang, 2019). 
  • Homophone substitution: These errors can occur if more than one lexical entry has the same pronunciation (phone sequence), i.e., they are homophones. While decoding, they homophones may be confused with one another causing errors. In general, the a well-functioning language model disambiguates in the event of such confusionshould disambiguate homophones based on the context.
  • Language model bias: Because of an undue bias bias  towards the language model (effected by a high language relative weight ) towards on the language model), the decoder may be forced to reject the true hypothesis in favor of a spurious candidatespurious candidate with high language model probability. These errors may occur along with analogous acoustic model bias.
  • Multiple acoustic problems: This is a broad category of errors comprising those due to bad pronunciation entries; disfluency, mispronunciation by the speaker himself/herself, or confused errors made by acoustic models (possibly due to acoustic noise, data mismatch between training and usage etc.).

6. Challenges of ASR

Recent advances in ASR has brought automatic speech recognition accuracy close to human performance in many practical tasks. However, but there are still challenges:

  • Out-of-vocabulary words are difficult to recognize correctly 
  • Varying environmental noises impair recognition accuracy.
  • Overlapping speech is problematic for ASR system.
  • Recognizing children's speech and the speech of people with speech production disabilities is suboptimal with regular training data.
  • DNN-based models usually require a lot of data for training, in the order of thousands of hours. End-to-end models may need up to 100,000h of speech for the good performance.


  • may need up to 100,000h of speech to reach high performance.
  • Uncertainty self-awareness is limited: typical ASR systems always output the most likely word sequence instead of reporting if some part of the input was incomprehensible or highly uncertain. 

7. Evaluation

The performance of an ASR system is measured by comparing the hypothesized transcriptions and reference transcriptions. Word error rate (WER) is the most widely used metric. The two word sequences are first aligned using a dynamic programming-based string alignment algorithm. After the alignment, the number of deletions (D), substitutions (S), and insertions (I) are determined. The deletions, substitutions and insertions are all considered as errors, and the WER is calculated by the rate of the number of errors to the number of words (N) in the reference.


Sentence Error Rate (SER) is also sometime used to evaluate the performance of ASR systems. SER computes the percentage of sentences with at least one error.


Zhang, X. (2019). Strategies for Handling Out-of-vocabulary Words in Automatic Speech Recognition. Doctoral dissertation, The Johns Hopkins University, Baltimore, Maryland.