Problem definition

In speech processing and elsewhere, a frequently appearing task is to make a prediction of an unknown vector y from available observation vectors x. Specifically, we want to have an estimate

\hat y = f(x)

such that

\hat y \approx y.

In particular, we will focus on linear estimates where

\hat y=f(x):=A^T x,

and where A is a matrix of parameters.

The minimum mean square estimate (MMSE)

Suppose we want to minimise the squared error of our estimate on average. The estimation error is

e=y-\hat y

and the squared error is the L2-norm of the error, that is,

\left\|e\right\|^2 = e^T e

and its mean can be written as the expectation

E\left[\left\|e\right\|^2\right] = E\left[\left\|y-\hat y\right\|^2\right] = E\left[\left\|y-A^T x\right\|^2\right].

Formally, the minimum mean square problem can then be written as

\min_A\, E\left[\left\|y-A^T x\right\|^2\right].

This can in generally not be directly implemented because we have the abstract expectation-operation in the middle.

(Advanced derivation begins) To get a computational model, first note that the error expectation can be written in terms of the mean of a sample of vector ek as

E\left[\left\|e\right\|^2\right] \approx \frac1N \sum_{k=1}^N \left\|e_k\right\|^2 = \frac1N {\mathrm{tr}}(E^T E),



and tr() is the matrix trace. To minimize the error energy expectation, we can then set its derivative to zero

= \frac{\partial}{\partial A} \frac1N {\mathrm{tr}}(E^T E) 
= \frac1N\frac{\partial}{\partial A} {\mathrm{tr}}((Y-A^TX)^T (Y-A^TX)) = \frac1N(Y-A^T X)X^T

where the observation matrix is


and the desired output matrix is


. (End of advanced derivation)

It follows that the optimal weight matrix A can be solved as

\boxed{A =  \left(XX^T\right)^{-1}XY^T = X^\dagger Y^T},

where the superscript


denotes the Moore-Penrose pseudo-inverse.

Estimates with a mean parameter

Suppose that instead of an estimate

\hat y=A^T x

, we want to include a mean vector in the estimate as

\hat y=A^T x + \mu

. While it is possible to derive all of the above equations for this modified model, it is easier to rewrite the model into a similar form as above with

\hat y=A^T x + \mu = 
\mu^T \\ A^T
1 \\ x
:= A'^T x'. 

That is, we can extend x by a single 1, (the observation X similarly with a row of constant 1s), and extend A to include the mean vector. With this modifications, the above Moore-Penrose pseudo-inverse can again be used to solve the modified model.

Estimates with linear equality constraints

(Advanced derivation begins)

In practical situations we often have also linear constraints, such as

C^T A = B

, which is equivalent with

C^T A - B = 0.

The modified programming task is then

\min_A\, E\left[\left\|y-A^T x\right\|^2\right]\quad\text{such that}\quad C^T A - B = 0.

For simplicity, let us consider only scalar estimation, where instead of vector y, as well as matrices A, B and C, respectively, we have scalar θ as well as vector a, b and and the optimization problem is

\min_a\, E\left[\left\|\theta-a^T x\right\|^2\right]\quad\text{such that}\quad c^T a - b = 0.

Such constraints can be included into the objective function using the method of Lagrange multipliers such that the modified objective function is

\eta(a,g) = E\, \left[\left\|\theta - a^T x\right\|^2\right] - 2 \left[ g^T \left(c^T a - b\right)\right].

A heuristic explanation of this objective function is based on the fact the g is a free parameter. Since its value can be anything, then

c^T a - b 

must be zero, because otherwise the output value of the objective function could be anything. That is, when optimizing with respect to a, we find the minimum of the mean square error, while simultaneously satisfying the constraint.

The objective function can further be rewritten as

\eta(a,g) &
= E\, \left[\left(\theta - a^T x\right)\left(\theta - a^T x\right)^T\right] - 2  \gamma \left(c^T a - \beta\right)
= E\, \left[\theta^2 - 2\theta x^T a + a^T xx^T a\right] - 2  \gamma \left(c^T a - \beta\right)
= \sigma_\theta^2 + a^T R_x^T a - 2  \gamma \left(c^T a - \beta\right)
  a^T & \gamma
  R_x & -c \\ -c^T & 0
  a \\ \gamma
  0 & \beta^T
  a \\ \gamma


(Advanced derivation ends)