Date: Tue, 28 Sep 2021 23:52:39 +0000 (UTC) Message-ID: <1620675378.34387.1632873159874@3ecc041e.tietoverkkopalvelut.fi> Subject: Exported From Confluence MIME-Version: 1.0 Content-Type: multipart/related; boundary="----=_Part_34386_427965004.1632873159858" ------=_Part_34386_427965004.1632873159858 Content-Type: text/html; charset=UTF-8 Content-Transfer-Encoding: quoted-printable Content-Location: file:///C:/exported.html Linear prediction

# Linear prediction

=20
=20
=20
=20

## Definition

Speech is a continuous signal, which means that consecutive samples of t= he signal are correlated (see figure on the right). In particular, if we kn= ow a previous sample xn-1, we can make a prediction= of the current sample, $$\hat x_n =3D x_{n-1},$$ such that $$\hat = x_n \approx x_n.$$ By using more previous samples we have more information= , which should help us make a better prediction. Specifically, we can defin= e a predictor which uses M previous samples to predict the cu= rrent sample xas

$\hat x_n =3D - \sum_{k= =3D1}^M a_k x_{n-k}.$=20

This is a linear predictor because it takes a linearly weighted= sum of past components to predict the current one.

The error of the prediction, also known as the prediction residual is

$e_n =3D x_n - \hat x_n =3D x_n + \sum_{k=3D1}^M a_k x_{n-k} = =3D \sum_{k=3D0}^M a_k x_{n-k},$=20

where a0=3D1. This explains why the definition $$\h= at x_n$$ included a minus sign; when we calculate the residual, the double= negative disappears and we can collate everything into one summation.

=20
=20
=20
=20

A short segment of speech. Notice how cons= ecutive samples are mostly near each other, which means that consecutive sa= mples are correlated. =20
=20
=20
=20
=20
=20

## Vector notation

Using vector notation, we can make the expressions more compact

$e= =3D Xa$=20

where

$e =3D \begin{bmatrix}e_0\\e_1\\\vdots\\e_{N-1}\end{bmatrix}= ,\qquad X =3D \begin{bmatrix}x_0 & x_{-1} & \dots & x_{M} \\x_1= & x_0 & \dots & x_{M-1} \\ \vdots & \vdots & & \vd= ots \\ x_{N-1} & x_{N-2} & \dots & x_{N-M} \end{bmatrix}, \qqua= d a =3D \begin{bmatrix}a_0\\a_1\\\vdots\\a_{M}\end{bmatrix}.$=20

Here we calculated the residual for a length N frame of the sig= nal.

=20
=20
=20
=20

=20
=20
=20
=20
=20
=20

## Parameter estimation

Vector a holds the unknown coefficients of the predictor. To fi= nd the best possible predictor, we can minimize the minimum mean-square err= or (MMSE). The square error is the 2-norm of the residual, $$\|e\|^2=3De^T= e$$ . The mean of that error is defined as the expectation

$E\left[= \|e\|^2\right] =3D E\left[a^T X^T X a\right] =3D a^T E\left[X^T X\right] a = =3D a^T R_x a,$=20

where $$R_x =3D E\left[X^T X\right]$$ and $$E\left[\cdot\right]$$ is= the expectation operator. Note that, as shown in the autocorrelation section, the ma= trix Rx, can be usually assumed to have a symmetric Toeplitz structure.

If we would directly minimize the mean-square error  $$E\left[\|e\= |^2\right],$$ then clearly we would obtain the trivial solution a=3D0<= /em>, which is not particularly useful. However that solution contradicts w= ith the requirement that the first coefficient is unity, a0= =3D1. In vector notation we can equivalently write

$a_0-1=3Du^T = a -1=3D0, \qquad\text{where}\,u=3D\begin{bmatrix}1\\0\\0\\\vdots\\0\end{bma= trix}.$=20

The standard method for quadratic minimization with constraints is to us= e a Langrange multiplier, =CE=BB, such that the objective fu= nction is

$\eta(a,\lambda) =3D a^T R_x a - 2\lambda\left(a^T u - 1\ri= ght).$=20

This function can be heuristically interpreted such that =CE=BB is a fre= e parameter. Since our objective is to minimize $$a^T R_x a$$ if   \= ( a^T u - 1 \) is non-zero, then the objective function can become arbitrar= ily large. To allow any value for =CE=BB, the constraint must therefore be = zero.

The objective function is then minimized by setting its derivative with = respect to a to zero

$0 =3D \frac\partial{\partial a}\eta(a,= \lambda) =3D \frac\partial{\partial a} \left[a^T R_x a -2\lambda\left(a^T u= - 1\right)\right] =3D 2 R_x a - 2 \lambda u.$=20

It follows that the optimal predictor coefficients are found by solving<= /p> $R_x a =3D \lambda u.$=20

Since Rx, is symmetric and Toeplitz, the = above system of equations can be efficiently solved using the Lev= inson-Durbin algorithm with algorithmic complexity O(M2)= . However, note that with direct solution we obtain $$a':=3D\frac1\la= mbda a =3D R_x^{-1}u$$ that is, instead of a we get a sc= aled with =CE=BB. However, since we know that a0=3D1, w= e can find a by $$a=3D\lambda a' =3D \frac{a'}{a'_0}.$$

## Spectral properties

Linear prediction is usually used to predict the current sample of a tim= e-domain signal xn. The usefulness of linear prediction= however becomes evident by studying its Fourier spectrum Specifically, sin= ce e=3DXa, the corresponding Z-domain representation is

$E(z= ) =3D X(z)A(z)\qquad\Rightarrow\qquad X(z)=3D\frac{E(z)}{A(z)},$=20

where E(z), X(z), and A(z), are the Z-transfo= rms of en, xn and an= , respectively. The residual E(z) is white-noise, whereby the= inverse A(z)-1, must follow the shape of X(z)= .

In other words, the linear predictor models the macro-shape or envel= ope of the spectrum.

=20
=20
=20
=20

=20
=20
=20
=20
=20
=20

## Physio= logical interpretation and model order

Linear prediction has a surprising connection with physical modelling of= speech= production. Namely, a linear predictive model is equivalent with a tube-model of the vocal tract (see figure on the right). A useful con= sequence is that from the acoustic properties of such a tube-model, we can = derive a relationship between the physical length of the vocal tract L<= /em> and the number of parameters M of the corresponding linear pr= edictor as

$M =3D \frac{2f_sL}c,$=20

where fs is the sampling frequency and c is= the speed of sound. With an air-temperature of 35 C, the speed of sound is= c=3D350m/s. The mean length of vocal tracts for females and males= are approximately 14.1 and 16.9 cm. We can then choose to overestimate&nbs= p;L=3D0.17m. At a sampling frequency of 16kHz, this gives  $$= M\approx 17$$ . The linear predictor will catch also features of the glot= tal oscillation and lip radiation, such that a useful approximation is $$M= \approx {\text{round}}\left(1.25\frac{f_s}{1000}\right)$$ . For different = sampling rates we then get the number of parameters M as

fs M
8 kHz 10
12.8 kHz 16
16 kHz 20

Observe however that even if a tube-model is equivalent with a linear pr= edictor, the relationship is non-linear and highly sensitive to small error= s. Moreover, when estimating linear predictive models from speech, in addit= ion to features of the vocal tract, we will also capture features of glotta= l oscillation and lip-radiation It is therefore very difficult to estimate = meaningful tube-model parameters from speech. A related sub-field of speech= analysis is glottal in= verse filtering, which attempts to estimate the glottal source from the= acoustic signal. A necessary step in such inverse filtering is to estimate= the acoustic effect of the vocal tract, that is, it is necessary to estima= te the tube model.

=20
=20
=20
=20

A tube model of the vocal tract consisting= of constant-radius tube-segments =20
=20
=20
=20
=20
=20

## Uses in speech coding

Linear prediction has been highly influential especially in early speech= coders. In fact, the dominant speech coding method is code-excited linear prediction (CELP)= , which is based on linear prediction.

=20
=20
=20
=20

=20
=20
=20
=20
=20
=20

## Alter= native representations (advanced topic)

Suppose scalars am,k, are the coefficients of an Mth order linear predictor. Coefficients of consecutive orders M<= /em> and M+1 are then related as

$a_{M+1,k} =3D a_{M,k} + \g= amma_{M+1} a_{M,M+1-k},$=20

where the real valued scalar $$\gamma_{M}\in(-1,+1)$$ is the Mth reflection coefficient. This formulation is the bas= is for the Levinson-Durbin algorithm which can be used to sol= ve the linear predictive coefficients. In a physical sense, reflection coef= ficients describe the amount of the acoustic wave which is reflected back i= n each junction of the tube-model. In other words, there is a relationship = between the cross-sectional areas Sk of each t= ube-segment and the reflection coefficients as

$\gamma_k =3D \frac{S_= k - S_{k+1}}{S_k + S_{k+1}}.$=20

Furthermore, the logarithmic ratio of cross-sectional areas, also known = as the log-area ratios, are defined as

$A_k =3D \l= og\frac{S_k}{S_{k+1}} =3D \log\frac{1-\gamma_k}{1+\gamma_k}.$=20

This form has been used in coding of linear predictive models, but is to= day mostly of historical interest.

=20
=20
=20
=20

=20
=20
=20