Ex. 2.9

Consider a linear regression model with \(p\) parameters, fit by least squares to a set of training data \((x_1, y_1), ..., (x_N, y_N)\) drawn at random from a population. Let \(\hat\beta\) be the least squares estimate. Suppose we have some test data \((\tilde x_1, \tilde y_1),...,(\tilde x_M, \tilde y_M)\) drawn at random from the same population as the training data. If \(R_{tr}(\beta) = \frac{1}{N}\sum_1^N(y_i-\beta^Tx_i)^2\) and \(R_{te}(\beta) = \frac{1}{M}\sum_1^M(\tilde y_i-\beta^T\tilde x_i)^2\), prove that

\[\begin{equation} E[R_{tr}(\hat\beta)]\le E[R_{te}(\hat\beta)],\nonumber \end{equation}\]

where the expectations are over all that is random in each expression. [This exercise was brought to our attention by Ryan Tibshirani, from a homework assignment given by Andrew Ng.]

Soln. 2.9

Note that both \(\textbf{X}\) and \(\textbf{Y}\) are considered random.

When \(\textbf{X}^T\textbf{X}\) is non-singular, we know that

\[\begin{equation} \hat \beta = (\textbf{X}^T\textbf{X})^{-1}\textbf{X}\textbf{Y},\nonumber \end{equation}\]

which is also random. When \(\textbf{X}^T\textbf{X}\) is singular, the simple expression above does not hold, however, there exists a measurable function \(\phi\) such that

\[\begin{equation} \hat \beta = \phi(\textbf{X}, \textbf{Y}).\nonumber \end{equation}\]

Recall the definition of \(\hat\beta\) and IID assumption of \((x_i, y_i)\) for \(i=1,...,N\). For any \(\beta\) and \(i=1,...,N\), we have

\[\begin{eqnarray} E[R_{tr}(\hat\beta)] &=& E\left[\frac{1}{N}\sum_{k=1}^N(y_k-\hat\beta^T x_k)^2\right]\nonumber\\ &\le& E\left[\frac{1}{N}\sum_{k=1}^N(y_k-\beta^T x_k)^2\right]\nonumber\\ &=&E[(y_i-\beta^T x_i)^2]. \ \ \ \ \ \ \ \ \ \ \ \label{eq:2-9a} \end{eqnarray}\]

Assume \(x_1\neq 0\) almost surely, let

\[\begin{equation} \beta = \frac{y_1-\tilde y_1+\hat\beta \tilde x_1}{x_1}.\nonumber \end{equation}\]

Plug equation above into \(\eqref{eq:2-9a}\) for \(i=1\), by IID assumption of \((\tilde x_i, \tilde y_i)\) for \(i=1,...,M\), we have

\[\begin{eqnarray} E[R_{tr}(\hat\beta)] &\le& E[(\tilde y_1-\hat\beta^T \tilde x_1)^2]\nonumber\\ &=&E\left[\frac{1}{M}\sum_{k=1}^M(\tilde y_k-\hat\beta^T \tilde x_k)^2\right]\nonumber\\ &=&E[R_{te}(\hat\beta)].\nonumber \end{eqnarray}\]