Ex. 6.8

Suppose that for continuous response \(Y\) and predictor \(X\), we model the joint density of \(X, Y\) using a multivariate Gaussian kernel estimator. Note that the kernel in this case would be the product kernel \(\phi_\lambda(X)\phi_\lambda(Y)\). Show that the conditional mean \(E[Y|X]\) derived from this estimate is a Nadaraya-Watson estimator. Extend this result to classification by providing a suitable kernel for the estimation of the joint distribution of a continuous \(X\) and discrete \(Y\).

Soln. 6.8

By definition we get

\[\begin{equation} E[Y|X] = \int y f(y|x)dy = \frac{\int y f(x,y)dy}{f(x)}.\non \end{equation}\]

The estimates give (see (6.23) in the text)

\[\begin{eqnarray} \hat f(x,y) &=& \frac{1}{N}\sum_{i=1}^N\phi_\lambda(x-x_i)\phi_\lambda(y-y_i)\non\\ \hat f(x) &=& \frac{1}{N}\sum_{i=1}^N\phi_\lambda(x-x_i).\non \end{eqnarray}\]

Thus we have

\[\begin{eqnarray} E[Y|X] &=& \frac{\int y \frac{1}{N}\sum_{i=1}^N\phi_\lambda(x-x_i)\phi_\lambda(y-y_i) dy}{\frac{1}{N}\sum_{i=1}^N\phi_\lambda(x-x_i)}\non\\ &=&\frac{\sum_{i=1}^N\phi_\lambda(x-x_i)\int y\phi_\lambda(y-y_i)dy}{\sum_{i=1}^N\phi_\lambda(x-x_i)}\non\\ &=&\frac{\sum_{i=1}^N\phi_\lambda(x-x_i)y_i}{\sum_{i=1}^N\phi_\lambda(x-x_i)},\label{eq:6-8a} \end{eqnarray}\]

where the last equations follows from

\[\begin{equation} \int y\phi_\lambda(y-y_i)dy = y_i.\non \end{equation}\]

Now consider the case when \(Y\) is discrete. Assume that \(Y\) takes values in the set \(J\subset Z=\{\cdots, -1, 0, 1, \cdots\}\). If we choose a naive frequency estimate for \(Y\), then we have

\[\begin{eqnarray} \hat f(x,i) &=& \hat f(i) \cdot \hat f(x|i)\non\\ &=&\frac{N_i}{N}\cdot\frac{1}{N_i}\sum_{l\in C(i)}\phi_\lambda(x-x_l)\non\\ &=&\frac{1}{N}\sum_{l\in C(i)}\phi_\lambda(x-x_l)\non\\ \hat f(x) &=& \frac{1}{N}\sum_{i=1}^N\phi_\lambda(x-x_i),\non \end{eqnarray}\]

where \(C(i)\) is the set of \(X\)'s such that the corresponding \(Y\)'s are in category \(i\) and \(N_i\) is the size of \(C(i)\). Then it's easy to verify that \(\eqref{eq:6-8a}\) holds.

Such estimate can be viewed as the combined kernel for the continuous component with the frequency for the discrete. We can further improve it by a smoothing over the estimate with respect to the discrete component \(Y\) using a discrete window weight function (see, e.g., Nonparametric estimation of joint discrete-continuous probability densities with applications).