Date: Sun, 1 Aug 2021 16:04:36 +0000 (UTC) Message-ID: <1858032561.43730.1627833876457@3ecc041e.tietoverkkopalvelut.fi> Subject: Exported From Confluence MIME-Version: 1.0 Content-Type: multipart/related; boundary="----=_Part_43729_1008524491.1627833876452" ------=_Part_43729_1008524491.1627833876452 Content-Type: text/html; charset=UTF-8 Content-Transfer-Encoding: quoted-printable Content-Location: file:///C:/exported.html Linear regression

# Linear regression

=20
=20
=20
=20

## Problem definition

In speech processing and elsewhere, a frequently appearing task is to ma= ke a prediction of an unknown vector y from available observation = vectors x. Specifically, we want to have an estimate $$\hat y =3D= f(x)$$ such that $$\hat y \approx y.$$ In particular, we will focus on = linear estimates where $$\hat y=3Df(x):=3DA^T x,$$ and where A is a matrix of parameters. The figure on the right illustrates a li= near model, where the input sample pairs (x,y) are modelled b= y a linear model  $$\hat y \approx ax.$$

=20
=20
=20
=20 =20
=20
=20
=20
=20
=20

## The minimum = mean square estimate (MMSE)

Suppose we want to minimise the squared error of our estimate on average= . The estimation error is $$e=3Dy-\hat y$$ and the squared erro= r is the L2-norm of the error, that is, $$\left\|e\rig= ht\|^2 =3D e^T e$$ and its mean can be written as the expectation $$E\lef= t[\left\|e\right\|^2\right] =3D E\left[\left\|y-\hat y\right\|^2\right] =3D= E\left[\left\|y-A^T x\right\|^2\right].$$ Formally, the minimum mean squa= re problem can then be written as

$\min_A\, E\left[\left\|y-A^T x\rig= ht\|^2\right].$=20

This can in generally not be directly implemented because we have the ab= stract expectation-operation in the middle.

(Advanced derivation begins) To get a computational model, firs= t note that the error expectation can be written in terms of the mean of a = sample of vector ek as

$E\left[\left\|e\righ= t\|^2\right] \approx \frac1N \sum_{k=3D1}^N \left\|e_k\right\|^2 =3D \frac1= N {\mathrm{tr}}(E^T E),$=20

where $$E=3D\left[e_1,\,e_2,\dotsc,e_N\right]$$ and tr() is t= he matrix trace. To minimize the error energy expectation, we can then set = its derivative to zero

$0 =3D \frac{\partial}{\partial A} = \frac1N {\mathrm{tr}}(E^T E) =3D \frac1N\frac{\partial}{\partial A} {\mathr= m{tr}}((Y-A^TX)^T (Y-A^TX)) =3D \frac1N(Y-A^T X)X^T$=20

where the observation matrix is $$X=3D\left[x_1,\,x_2,\dotsc,x_N\right]=$$ and the desired output matrix is $$Y=3D\left[y_1,\,y_2,\dotsc,y_N\righ= t]$$ . (End of advanced derivation)

It follows that the optimal weight matrix A can be solved as $\boxed{A =3D \left(XX^T\right)^{-1}XY^T =3D X^\dagger Y^T},$=20

where the superscript $$\dagger$$ denotes the Moore-Penrose pseudo-inverse.

=20
=20
=20
=20

=20
=20
=20
=20
=20
=20

### Estimates with a me= an parameter

Suppose that instead of an estimate $$\hat y=3DA^T x$$ , we want to in= clude a mean vector in the estimate as $$\hat y=3DA^T x + \mu$$ . While i= t is possible to derive all of the above equations for this modified model,= it is easier to rewrite the model into a similar form as above with

$= \hat y=3DA^T x + \mu =3D \begin{bmatrix} \mu^T \\ A^T \end{bmatrix} \begin= {bmatrix} 1 \\ x \end{bmatrix} :=3D A'^T x'.$=20

That is, we can extend x by a single 1, (the observation X<= /em> similarly with a row of constant 1s), and extend A to include= the mean vector. With this modifications, the above Moore-Penrose pseudo-i= nverse can again be used to solve the modified model.

=20
=20
=20
=20

=20
=20
=20
=20
=20
=20

### Estimate= s with linear equality constraints

In practical situations we often have also linear constraints, such as \= ( C^T A =3D B \) , which is equivalent with $$C^T A - B =3D 0.$$ The modi= fied programming task is then

$\min_A\, E\left[\left\|y-A^T x\right\|= ^2\right]\quad\text{such that}\quad C^T A - B =3D 0.$=20

For simplicity, let us consider only scalar estimation, where instead of= vector y, as well as matrices A, B and C, respe= ctively, we have scalar =CE=B8 as well as vector a, b and and the optimization problem is

$\min_a\, E\lef= t[\left\|\theta-a^T x\right\|^2\right]\quad\text{such that}\quad c^T a - b = =3D 0.$=20

Such constraints can be included into the objective function using the m= ethod of Lagrange multipliers such that the modified objecti= ve function is

$\eta(a,g) =3D E\, \left[\left\|\theta - a^T x\right\|= ^2\right] - 2 \left[ g^T \left(c^T a - b\right)\right].$=20

A heuristic explanation of this objective function is based on the fact = the g is a free parameter. Since its value can be anything, then \= ( c^T a - b \) must be zero, because otherwise the output value of the obje= ctive function could be anything. That is, when optimizing with respect to = a, we find the minimum of the mean square error, while simultaneou= sly satisfying the constraint.

The objective function can further be rewritten as

$\begin{split} = \eta(a,g) & =3D E\, \left[\left(\theta - a^T x\right)\left(\theta - a^T= x\right)^T\right] - 2 \gamma \left(c^T a - \beta\right) \\& =3D E\, \l= eft[\theta^2 - 2\theta x^T a + a^T xx^T a\right] - 2 \gamma \left(c^T a - \= beta\right) \\& =3D \sigma_\theta^2 + a^T R_x^T a - 2 \gamma \left(c^T = a - \beta\right) \\& =3D \begin{bmatrix} a^T & \gamma \end{bmatrix}= \begin{bmatrix} R_x & -c \\ -c^T & 0 \end{bmatrix} \begin{bmatrix}= a \\ \gamma \end{bmatrix} -2 \begin{bmatrix} 0 & \beta^T \end{bmatrix}= \begin{bmatrix} a \\ \gamma \end{bmatrix} + \sigma_\theta^2 \\& =3D \b= egin{bmatrix} a^T & \gamma-\beta \end{bmatrix} \begin{bmatrix} R_x &= ; -c \\ -c^T & 0 \end{bmatrix} \begin{bmatrix} a \\ \gamma-\beta \end{b= matrix} + \text{constant} \\& :=3D (a'-\mu)^T R' (a'-\mu) + \text{const= ant} . \end{split}$=20

where  $$R_x =3D E[xx^T]$$ and "constant" refers to a co= nstant which does not depend on a or =CE=B3.

In other words, with a straightforward approach, equality constraints ca= n in general also be merged into a quadratic form. We can therefore reduce = constrained problems to unconstrained problems which can be easily solved.<= /p>

Inequality constraints can be reduced to quadratic forms with s= imilar steps as follows. Suppose we have a task  $$\min f(x)\,\text{s= uch that}\,x \geq b.$$ We would then first solve the unconstrained problem= $$\min f(x)$$ and check whether the constraint is satisfied. If not, the= n the inequality constraint is "active", that is, we must have x= =3Db. We can then rewrite the inequality constraint as an equality con= straint and solve the problem as above.

In the case that we have multiple constraints over multiple dimensions, = we can keep merging them one by one until we have an entire unconstrained p= rogramming problem.

=20
=20
=20
=20

=20
=20
=20
=20
=20
=20

## Some applications

• Linear prediction in mo= delling of the spectral envelope of speech
• Noise attenuation with Wiener filtering
=20
=20
=20
=20

=20
=20
=20
=20
=20
=20

## Discussion

Linear regression is just about the simplest thing you can do to model d= ata. If that works then it's perfect! Especially for estimating low-dimensi= onal from high-dimensional data, linear estimates can be very useful. In an= y case, it is always a good approach to start modelling with the simplest p= ossible model, which usually is a linear model. If nothing else, that gives= a good baseline. The first figure on this page demonstrates a case where a= linear model does do a decent job at modelling the data.

Naturally there are plenty of situations where linear models are insuffi= cient, such as when the data

• follows non-linear relationships
• has discontinuities or when the data
• contains multiple classes with different properties.

Moreover, in many cases we are not interested in modelling the average s= ignal, but to recreate a signal which contains all the complexity of the or= iginal signal. Say, if we want to synthesize speech, then "average speech" = can sound dull. Instead, we would like to reproduce all the colorfulness an= d expressiveness of a natural speaker. A model of the statistical distribut= ion of the signal can then be more appropriate, such as the Gaussian mixture model (GMM).

Another related class of models are Sub-space models, where the input signal is modeled in a lo= wer-dimensional space such that dimensions related to background noise are = cancelled and the desired speech signal is retained.

=20
=20
=20
=20

=20
=20
=20