## Problem definition

In speech processing and elsewhere, a frequently appearing task is to make a prediction of an unknown vector *y* from available observation vectors *x*. Specifically, we want to have an estimate
\( \hat y = f(x) \)
such that
\( \hat y \approx y. \)
In particular, we will focus on *linear estimates* where
\( \hat y=f(x):=A^T x, \)
and where *A* is a matrix of parameters.

## The minimum mean square estimate (MMSE)

Suppose we want to minimise the squared error of our estimate on average. The estimation error is* *
\( e=y-\hat y \)
and the squared error is the *L _{2}*-norm of the error, that is,
\( \left\|e\right\|^2 = e^T e \)
and its mean can be written as the expectation
\( E\left[\left\|e\right\|^2\right] = E\left[\left\|y-\hat y\right\|^2\right] = E\left[\left\|y-A^T x\right\|^2\right]. \)
Formally, the minimum mean square problem can then be written as

This can in generally not be directly implemented because we have the abstract expectation-operation in the middle.

*(Advanced derivation begins)* To get a computational model, first note that the error expectation can be written in terms of the mean of a sample of vector *e _{k}* as

where
\( E=[e_1,\,e_2,\dotsc,e_N] \)
and *tr()* is the matrix trace. To minimize the error energy expectation, we can then set its derivative to zero

where the observation matrix is
\( X=[x_1,\,x_2,\dotsc,x_N] \)
and the desired output matrix is
\( Y=[y_1,\,y_2,\dotsc,y_N] \)
. *(End of advanced derivation)*

It follows that the optimal weight matrix *A* can be solved as

where the superscript \( \dagger \) denotes the Moore-Penrose pseudo-inverse.