Ex. 2.7
Suppose we have a sample of \(N\) pairs \(x_i, y_i\) drawn i.i.d. from the distribution characterized as follows:
\[\begin{eqnarray}
&&x_i\sim h(x), \text{ the design density}\nonumber\\
&&y_i = f(x_i) + \epsilon_i, f \text{ is the regression function}\nonumber\\
&&\epsilon_i\sim (0, \sigma^2)\ (\text{mean zero, variance } \sigma^2)\nonumber
\end{eqnarray}\]
We construct an estimate for \(f\) linear in the \(y_i\),
\[\begin{equation}
\label{eq:2-7linear}
\hat f(x_0) = \sum_{i=1}^N\ell_i(x_0;\mathcal{X})y_i,
\end{equation}\]
where the weights \(\ell_i(x_0;\mathcal{X})\) do not depend on the \(y_i\), but do depend on the entire training sequence of \(x_i\), denoted here by \(\mathcal{X}\).
(a)
Show that linear regression and \(k\)-nearest-neighbor regression are members of this class of estimators. Describe explicitly the weights \(\ell_i(x_0;\mathcal{X})\) in each of those cases.
(b)
Decompose the conditional mean-squared error
\[\begin{equation}
E_{\mathcal{Y}|\mathcal{X}}(f(x_0)-\hat f(x_0))^2\nonumber
\end{equation}\]
into a conditional squared bias and a conditional variance component. Like \(\mathcal{X}, \mathcal{Y}\) represents the entire training sequence of \(y_i\).
(c)
Decompose the (unconditional) mean-squared error
\[\begin{equation}
E_{\mathcal{Y}, \mathcal{X}}(f(x_0)-\hat f(x_0))^2\nonumber
\end{equation}\]
into a squared bias and a variance component.
(d)
Establish a relationship between the squared biases and variances in the above two cases.
Remark
A smoother \(\hat f\) is called a linear smoother (see Section 5.4 in the text for more details) if it has the form
\[\begin{equation}
{\hat f} = \textbf{S}\textbf{y}.\nonumber
\end{equation}\]
Note that the linearity implies that \(\textbf{S}\) does not depend on \(\textbf{y}\).
For linear regression, \(\textbf{S} = \textbf{X}(\textbf{X}^T\textbf{X})^{-1}\textbf{X}^T\). As for the bias and variance, we have
\[\begin{eqnarray}
\text{Bias}({\hat f}) &=& \textbf{f} - \textbf{S}\textbf{f},\nonumber\\
\text{Cov}({\hat f}) &=&\textbf{S}\text{Cov}(\textbf{y})\textbf{S}^T.\nonumber
\end{eqnarray}\]
Soln. 2.7
(a)
For linear regression, we have
\[\begin{equation}
\hat f(x_0) = [x_0, 1](\textbf{X}^T\textbf{X})^{-1}\textbf{X}^Ty,\nonumber
\end{equation}\]
so that
\[\begin{equation}
\ell_i(x_0;\mathcal{X}) = [x_0, 1](\textbf{X}^T\textbf{X})^{-1}\begin{pmatrix}
1 \\
x_i
\end{pmatrix}.\nonumber
\end{equation}\]
For \(k\)-nearest-neighbor regression, we have
\[\begin{equation}
\hat f(x_0) = \frac{1}{k}\sum_{x_i\in N_k(x_0)}y_i,\nonumber
\end{equation}\]
where \(N_k(x_0)\) is the neighborhood of \(x_0\) defined by the \(k\) closest points \(x_i\) in the training sample. Therefore,
\[\begin{equation}
\ell_i(x_0;\mathcal{X}) = \begin{cases}
\frac{1}{k}, & \text{if } x_i \in N_k(x_0)\\
0, & \text{otherwise.}\nonumber
\end{cases}
\end{equation}\]
(b)
Note that \(\mathcal{X}\) is fixed and randomness comes from \(\mathcal{Y}\) only. We have
\[\begin{eqnarray}
E_{\mathcal{Y}|\mathcal{X}}\left(f(x_0) - \hat f(x_0)\right)^2 &=& f(x_0)^2 - 2f(x_0)E_{\mathcal{Y}|\mathcal{X}}(\hat f(x_0)) + E_{\mathcal{Y}|\mathcal{X}}(\hat f(x_0)^2)\nonumber\\
&=& \left(f(x_0)-E_{\mathcal{Y}|\mathcal{X}}(\hat f(x_0))\right)^2\nonumber\\
&& + E_{\mathcal{Y}|\mathcal{X}}(\hat f(x_0)^2) - \left(E_{\mathcal{Y}|\mathcal{X}}(\hat f(x_0))\right)^2\nonumber\\
&=& \text{Bias}_{\mathcal{Y}|\mathcal{X}}(\hat f(x_0))^2 + \text{Var}_{\mathcal{Y}|\mathcal{X}}(\hat f(x_0)).\nonumber
\end{eqnarray}\]
(c)
The calculation logic is the same as (b), we have
\[\begin{eqnarray}
E_{\mathcal{Y}, \mathcal{X}}\left(f(x_0) - \hat f(x_0)\right)^2 &=& f(x_0)^2 - 2f(x_0)E_{\mathcal{Y}, \mathcal{X}}(\hat f(x_0)) + E_{\mathcal{Y}, \mathcal{X}}(\hat f(x_0)^2)\nonumber\\
&=& \left(f(x_0)-E_{\mathcal{Y}, \mathcal{X}}(\hat f(x_0))\right)^2\nonumber\\
&& + E_{\mathcal{Y}, \mathcal{X}}(\hat f(x_0)^2) - \left(E_{\mathcal{Y}, \mathcal{X}}(\hat f(x_0))\right)^2\nonumber\\
&=& \text{Bias}(\hat f(x_0))^2 + \text{Var}(\hat f(x_0)).\nonumber
\end{eqnarray}\]
(d)
From (b) we already see that \(Bias_{\mathcal{Y}|\mathcal{X}}(\hat f(x_0))\) can be written as
\[\begin{eqnarray}
&& f(x_0) - E_{\mathcal{Y}|\mathcal{X}}\hat f(x_0)\nonumber\\
&=& f(x_0) - \sum_{i=1}^NE_{\mathcal{Y}|\mathcal{X}}\ell_i(x_0;\mathcal{X})(f(x_i) + \epsilon_i)\nonumber\\
&=& f(x_0) - \sum_{i=1}^N\ell_i(x_0;\mathcal{X})f(x_i)\ \ \ \ \ \ \ \ \ \label{eq:2-7biasb}
\end{eqnarray}\]
Also, we write \(Var_{\mathcal{Y}|\mathcal{X}}(\hat f(x_0))\) as
\[\begin{eqnarray}
&&E_{\mathcal{Y}|\mathcal{X}}(\hat f(x_0)^2)
- \left(E_{\mathcal{Y}, \mathcal{X}}(\hat f(x_0))\right)^2\nonumber\\
&=& E_{\mathcal{Y}|\mathcal{X}}\left[\sum_{i=1}^N\sum_{j=1}^N\ell_i(x_0;\mathcal{X})\ell_j(x_0;\mathcal{X})(f(x_0)+\epsilon_i)(f(x_0)+\epsilon_j)\right]
\nonumber\\
&& - \left(\sum_{i=1}^N\ell_i(x_0;\mathcal{X})f(x_i)\right)^2\nonumber\\
&=&\sum_{i=1}^N\sum_{j=1}^N\ell_i(x_0;\mathcal{X})\ell_j(x_0;\mathcal{X})f(x_i)f(x_j)\nonumber\\
&& + \sigma^2\sum_{i=1}^N\ell^2_i(x_0;\mathcal{X})\nonumber\\
&& - \sum_{i=1}^N\sum_{j=1}^N\ell_i(x_0;\mathcal{X})\ell_j(x_0;\mathcal{X})f(x_i)f(x_j)\nonumber\\
&=&\sigma^2\sum_{i=1}^N\ell^2_i(x_0;\mathcal{X})\nonumber
\end{eqnarray}\]
Denote \(S = (\ell_1(x_0;\mathcal{X}), ..., \ell_N(x_0;\mathcal{X}))^T\) and \(f = (f(x_1),...,f(x_N))^T\). By \(\eqref{eq:2-7biasb}\) and the equation above, we have
\[\begin{eqnarray}
Bias_{\mathcal{Y}|\mathcal{X}}(\hat f(x_0)) &=& f(x_0) - S^Tf,\nonumber\\
Var_{\mathcal{Y}|\mathcal{X}}(\hat f(x_0)) &=& \sigma^2S^TS.\nonumber
\end{eqnarray}\]
Assume that \(SS^T\) is non-singular, note that \(S^TS\) is a scalar, we have
\[\begin{eqnarray}
[Bias_{\mathcal{Y}|\mathcal{X}}(\hat f(x_0))]^2 &=& (f(x_0) - S(x_0)^Tf)^T(f(x_0) - S(x_0)^Tf)\nonumber\\
&=&f(x_0)^2 + 2f(x_0)S^Tf - f^TSS^Tf\nonumber\\
&=&f(x_0)^2 + 2f(x_0)S^Tf - f^TSS^TSS^T(SS^T)^{-1}f\nonumber\\
&=&f(x_0)^2 + 2f(x_0)S^Tf - \frac{Var_{\mathcal{Y}|\mathcal{X}}(\hat f(x_0))}{\sigma^2}f^Tf\nonumber\\
&=&f(x_0)^2 + 2f(x_0)\Big(f(x_0)-Bias_{\mathcal{Y}|\mathcal{X}}(\hat f(x_0))\Big)\nonumber\\
&& -\frac{f^Tf}{\sigma^2}Var_{\mathcal{Y}|\mathcal{X}}(\hat f(x_0)).\nonumber
\end{eqnarray}\]
That is the relationship between the squared biases and variances.
For (c), similar arguments follow by integrating terms above by joint density of \(x_1,...,x_N\), that is, \(h(x_1)\cdots h(x_N)dx_1\cdots dx_N\).