## Problem definition

In speech processing and elsewhere, a frequently appearing task is to make a prediction of an unknown vector *y* from available observation vectors *x*. Specifically, we want to have an estimate
\( \hat y = f(x) \)
such that
\( \hat y \approx y. \)
In particular, we will focus on *linear estimates* where
\( \hat y=f(x):=A^T x, \)
and where *A* is a matrix of parameters.

## The minimum mean square estimate (MMSE)

Suppose we want to minimise the squared error of our estimate on average. The estimation error is* *
\( e=y-\hat y \)
and the squared error is the *L _{2}*-norm of the error, that is,
\( \left\|e\right\|^2 = e^T e \)
and its mean can be written as the expectation
\( E\left[\left\|e\right\|^2\right] = E\left[\left\|y-\hat y\right\|^2\right] = E\left[\left\|y-A^T x\right\|^2\right]. \)
Formally, the minimum mean square problem can then be written as

This can in generally not be directly implemented because we have the abstract expectation-operation in the middle.

*(Advanced derivation begins)* To get a computational model, first note that the error expectation can be written in terms of the mean of a sample of vector *e _{k}* as

where
\( E=\left[e_1,\,e_2,\dotsc,e_N\right] \)
and *tr()* is the matrix trace. To minimize the error energy expectation, we can then set its derivative to zero

where the observation matrix is
\( X=\left[x_1,\,x_2,\dotsc,x_N\right] \)
and the desired output matrix is
\( Y=\left[y_1,\,y_2,\dotsc,y_N\right] \)
. *(End of advanced derivation)*

It follows that the optimal weight matrix *A* can be solved as

where the superscript \( \dagger \) denotes the Moore-Penrose pseudo-inverse.

### Estimates with a mean parameter

Suppose that instead of an estimate \( \hat y=A^T x \) , we want to include a mean vector in the estimate as \( \hat y=A^T x + \mu \) . While it is possible to derive all of the above equations for this modified model, it is easier to rewrite the model into a similar form as above with

\[ \hat y=A^T x + \mu = \begin{bmatrix} \mu^T \\ A^T \end{bmatrix} \begin{bmatrix} 1 \\ x \end{bmatrix} := A'^T x'. \]That is, we can extend *x* by a single 1, (the observation *X* similarly with a row of constant 1s), and extend *A* to include the mean vector. With this modifications, the above Moore-Penrose pseudo-inverse can again be used to solve the modified model.

### Estimates with linear equality constraints

*(Advanced derivation begins)*

In practical situations we often have also linear constraints, such as \( C^T A = B \) , which is equivalent with \( C^T A - B = 0. \) The modified programming task is then

\[ \min_A\, E\left[\left\|y-A^T x\right\|^2\right]\quad\text{such that}\quad C^T A - B = 0. \]For simplicity, let us consider only scalar estimation, where instead of vector *y*, as well as matrices *A, B *and* C*, respectively, we have scalar *θ* as well as vector *a, b* and *c *and the optimization problem is

Such constraints can be included into the objective function using the method of Lagrange multipliers such that the modified objective function is

\[ \eta(a,g) = E\, \left[\left\|\theta - a^T x\right\|^2\right] - 2 \left[ g^T \left(c^T a - b\right)\right]. \]A heuristic explanation of this objective function is based on the fact the *g* is a free parameter. Since its value can be anything, then
\( c^T a - b \)
must be zero, because otherwise the output value of the objective function could be anything. That is, when optimizing with respect to *a*, we find the minimum of the mean square error, while simultaneously satisfying the constraint.

The objective function can further be rewritten as

\[ \begin{split} \eta(a,g) & = E\, \left[\left(\theta - a^T x\right)\left(\theta - a^T x\right)^T\right] - 2 \gamma \left(c^T a - \beta\right) \\& = E\, \left[\theta^2 - 2\theta x^T a + a^T xx^T a\right] - 2 \gamma \left(c^T a - \beta\right) \\& = \sigma_\theta^2 + a^T R_x^T a - 2 \gamma \left(c^T a - \beta\right) \\& = \begin{bmatrix} a^T & \gamma \end{bmatrix} \begin{bmatrix} R_x & -c \\ -c^T & 0 \end{bmatrix} \begin{bmatrix} a \\ \gamma \end{bmatrix} -2 \begin{bmatrix} 0 & \beta^T \end{bmatrix} \begin{bmatrix} a \\ \gamma \end{bmatrix} . \end{split} \]TBC

*(Advanced derivation ends)*