Paralinguistic speech processing (PSP) refers to analysis of speech signals with the aim of extracting information that is different from the linguistic content of speech (hence paralinguistic = alongside linguistic content). Speaker recognition or verification is also traditionally considered as a separate problem, and does not fall within the scope of paralinguistic tasks.
In addition to information that is not directly related to intended communicative goals, speech also contains paralinguistic characteristics related to communication. This is because speech has co-evolved with the development of other social skills in humans over thousands of years, and therefore speech (and gestures) can play different types of social coordinative roles beyond the literal linguistic message transmitted. For instance, prosody (or word choices) can reflect different social roles such as submissiveness or authority in different interactions. Attitudes and emotions showing up in speech can also be considered as communicative signals facilitating social interaction and cohesion, not just as speaker-internal states that inadvertently "leak out" for others to perceive. By demonstrating anger or happiness not just through visual (facial) gestures but also through voice, important information regarding social dynamics can be transmitted without requiring a constant visual contact between the interlocutors.
The basic aim of paralinguistic speech processing PSP is to use computational means to understand and characterize the ways that different non-linguistic factors shape the speech signal, and to build automatic systems for analyzing and detecting the paralinguistic factors from real speech captured in various settings.
- Emotion detection
- Personality classification
- Sleepiness detection
- Analysis of cognitive or physical load
- Health-related analyses (cold, snoring, neurodegenerative diseases etc.)
- Speech addressee analysis (e.g., adult vs. child-directed speech)
- Age and gender recognition
- Sincerity analysis
Basic problem formulation and characteristics
The basic goal of paralinguistic analysis is to extract information of interest while ignoring the signal variability introduced by linguistic content, speaker identity, and other nuisance factors such as background noise and transmission channel characteristics. However, for some tasks, it may also be useful to analyse the language content of speech in order to infer information regarding the phenomena of interest.
A relavely common property of PSP is that access to be added....high-quality labeled data is limited. The data collection itself is often challenging and may include important ethical considerations, such as collecting data from intoxicated speakers or speakers with rare diseases.
Availability of good ground-truth labels for the speech data can also be difficult. For instance, judging of the underlying emotional states of speakers is difficult, whereas induced emotional speech by professional actors may not properly reflect the variability of emotional speech in real world communicative senarios. Assessing the severity of many diseases is also based on indirect diagnostics instead of having some type of oracle knowledge on a universally standardized scale. Every time humans are used for data labelling (e.g., assessing emotions), there is a certain degree of inter-annotator inconsistency due to differing opinions and general variability in human performance. This is the case even when domain experts are used for the task. Naturally, the more difficult the task or more ambiguous the phenomenon, the more there will be noise in the human-based ground-truth labels.
Finally, many types of interesting PSP data cannot be freely distributed to the research community due to data ownership and human participant privacy protection considerations. As a clear example, speech with metadata related to factors such as health or IQ of the speakers is highly sensitive in nature, and not all speakers consent to open distribution of their identifiable voice together with such private data of themselves. The data ownership considerations inherently limit the pooling of different speech corpora in order to build more comprehensive databases of speech related to different phenomena of interest, and generally slow down replicable open science. On the other hand, it is of utmost importance to respect the privacy of human participants in PSP (or any other) research—not only due to ethical considerations, but also since the entire field depends on access to data from voluntary human participants.
to be continued...
Computational Paralinguistic Challenge (http://www.compare.openaudio.eu/)