Objective estimators for perceptual quality
With "objective evaluation" we usually refer to estimators of perceptual quality, where the objective is to predict the mean output of a subjective listening test using an algorithm. That is, we want a computer to listen to a sound sample and try to "guess" what a human listener would say about its quality (on average).
It is then clear that subjective evaluation is always the "true" measure of performance and objective evaluation is an approximation thereof. In this sense, subjective evaluation is "better". However, there are many good reasons to use objective instead of subjective evaluation:
- Subjective evaluation is expensive; a test requires that a large number of persons listens to sound samples, which is both time-consuming and requires infrastructure. Objective evaluation is performed on a computer, such that you can generally test a large number of sound samples in a short time.
- Subjective evaluation is noisy; even with a large number of expert listeners it is generally difficult to get exactly the same result in two consecutive tests. Objective evaluation always gives the same rating for the same input, such that testing is consistent and reliable. This is especially important for scientific reproducibility; an independent laboratory can verify and confirm your results, the objective measure always gives the same output. With subjective evaluation, independent researchers can get different results, and you can never be 100% certain where the difference in results comes from. Did one of the researchers do an error or is it just that subjective listeners give always slightly different results?
Some of the most frequently used objective measures include:
- PESQ is probably the most frequently used objective evaluation method and it is defined in ITU-T Recommendation P.862: Perceptual evaluation of speech quality (PESQ): An objective method for end-to-end speech quality assessment of narrow-band telephone networks and speech codecs (2001). It is thus an evaluation method designed explicitly for telecommunications applications. It estimates the mean score of an P.800 ACR test.
PESQ accepts only narrow-band input and is not directly applicable on other bandwidths. The degradation types whose effect PESQ can reliably predict are
Speech input levels to a codec
Transmission channel errors
Packet loss and packet loss concealment with CELP codecs
Bit rates if a codec has more than one bit-rate mode
Environmental noise at the sending side
Effect of varying delay in listening only tests
Short-term time warping of audio signal
Long-term time warping of audio signal
- Perceptual Objective Listening Quality Assessment (POLQA) is the successor of PESQ and defined in ITU-T Recommendation P.863: Perceptual objective listening quality assessment. It is important to notice that for most practical purposes, POLQA is better than PESQ. It has a wider range of applications and acceptable degradation types and the output is more reliable. However, from a scientific perspective it is extremely regrettable that implementations of POLQA are commercial and expensive products, rendering application of POLQA infeasible in normal scientific work. Even if an individual team could afford purchasing a POLQA licence, verification of POLQA results by independent research labs is possible only if they also purchase a POLQA licence. Despite of its limitations, PESQ has therefore remained the scientific standard in objective evaluation of speech.
- Perceptual Evaluation of Audio Quality (PEAQ) evaluates, instead of only speech, also other types of audio samples. It is therefore less accurate with respect to distortions specific to speech signals, but it generalizes better to other audio such as music and background noises. The measure is defined in ITU-R Recommendation BS.1387: Method for objective measurements of perceived audio quality (PEAQ).
- The short-term objective intelligibility (STOI) measure focuses on how intelligible a speech sample is. It is thus clearly focused on lower-quality scenarios where speech is so badly corrupted that it is hard to understand what is said. Like all objective measures, it is not a completely reliable estimate of quality, but can be useful in combination with other measures. A good feature of STOI is that an implementation is available.
Other objective performance criteria
There are many cases where other performance criteria are well-warranted than merely prediction of subjective listening test results. Most typically these criteria are applied when there is no user involved, such as speech recognition, or, when we want to have more detailed characterization of performance than given by predictors of subjective listening test results.
Some examples of such performance criteria include:
- Word error rate (WER) is used in speech recognition to measure the proportion of words correctly recognized from a test signal.
- Signal to noise ratio (SNR) is used to measure the proportion of the desirable speech signal and undesirable noise components (which includes for example background noises, distortions caused by processing algorithms and transmission, as well as undesirable competing speakers).
- Perceptual signal to noise ratio (pSNR) measures SNR in a perceptually motivated domain. Essentially distortions are weighted such that they approximately correspond to human perception. This is similar to the above predictors of subjective listening tests, but works also on small segments of speech. It can be used to for detailed analysis of distortions to, for example, which parts of the signal contain undesirable distortions.
- The speech distortion index (SDI) measures the amount by which a desirable speech signal is distorted. In speech enhancement, it is often used in combination with the noise attenuation factor (NAF), which measures the amount by which undesirable noises are removed. It is clear that by doing nothing, we obtain a perfect SDI and by setting the output to zero, we obtain a perfect NAF. Neither outcome is usually satisfactory. It is therefore usually not clear what the right balance between the two measures are.
- Unweighted and weighted average recall (UAR, WAR) are often used to measure performance in speech classification tasks, such as classifying a speech segment into one of finite number of possible emotions. UAR is defined as the mean of class-specific recalls (the proportion of class samples recognized correctly) while WAR is the overall proportion of samples recognized correctly across all classes (sometimes also referred to as accuracy). UAR is often preferred over WAR in experiments where there is a notable class imbalance in the test data, and where it is important to have systems that are also sensitive to the less-frequent classes.
- Receiver operating characteristic (ROC) curves and its derivatives such as area under the curve (AUC) or equal error rate (EER) are often used to report performance of systems that have some type of detection threshold that can be varied, and when performance for each threshold value is measured in terms of precision and recall. For instance, performance of speaker verification systems is often evaluated using such metrics.