Ex. 6.5

Show that fitting a locally constant multinomial logit model of the form (6.19) amounts to smoothing the binary response indicators for each class separately using a Nadaraya-Watson kernel smoother with kernel weights \(K_\lambda(x_0, x_i)\).

Soln. 6.5

If we smooth binary response indicators for each class separately using a Nadaraya-Watson kernel smoother, recall (6.2) in the text, for any class \(j\in \{1,...,K\}\), \(y_i=1\) if and only if \(i\in G_j\) where \(G_j\) is the set of indices that belong to class \(j\). Then we have

\[\begin{equation} \text{P}(G=j|X=x_0) = \frac{\sum_{i\in G_j} K_\lambda(x_0, x_i)}{\sum_{i=1}^NK_\lambda(x_0, x_i)}\propto \sum_{i\in G_j} K_\lambda(x_0, x_i).\non \end{equation}\]

That means for \(x_0\), we classify it to class \(j\) that maximizes \(\sum_{i\in G_j} K_\lambda(x_0, x_i)\).

On the other hand, consider the local multinomial logit model (6.19) in the text. From (6.20) in the text,

\[\begin{equation} \hat{\text{Pr}}(G=j|X=x_0) = \frac{e^{\hat\beta_{j0}(x_0)}}{1+\sum_{k=1}^{J-1}e^{\hat\beta_{k0}(x_0)}}\non \end{equation}\]

we know that we classify \(x_0\) to class \(j\) that maximizes \(\hat\beta_{j0}\) for each class \(j\). It thus suffices to show that \(\hat\beta_{j0}\) is a non-decreasing function of \(\sum_{i\in G_j} K_\lambda(x_0, x_i)\).

Let \(\beta\) denote the parameter set \(\{\beta_{k0}, \beta_k, k=1,...,K-1\}\). We code the class from \(1\) to \(K\) so that the log-likelihood \(l(\beta, x_0)\) can be rewritten as (see, e.g., Ex. 4.4)

\[\begin{eqnarray} \label{eq:6-5a} l(\beta, x_0) &=&\sum_{i=1}^NK_\lambda(x_0, x_i)\Bigg\{\sum_{k=1}^{K-1}\bb{1}(y_i=k)\left[\beta_{k0}(x_0) + \beta_{k}(x_0)^T(x_i-x_0)\right]\non\\ &&\ \ -\log\Big(1+\sum_{l=1}^{K-1}e^{\beta_{l0}(x_0) + \beta_{l}^T(x_0)(x_i-x_0)}\Big)\Bigg\}.\non \end{eqnarray}\]

To maximize the log-likelihood, we need to set its derivatives to zero and then can solve the equations to find \(\hat\beta_{k0}\). These equations, for \(j=1,...,K-1\), are

\[\begin{eqnarray} \label{eq:6-5b} \frac{\partial l(\beta, x_0)}{\partial \beta_{j0}} &=& \sum_{i=1}^N K_\lambda(x_0, x_i) \left(\bb{1}(y_i=j) - \frac{e^{\beta_{j0} + \beta_{j}^T(x_i-x_0)}}{1+\sum_{l=1}^{K-1}e^{\beta_{l0} + \beta_l^T(x_i-x_0)}}\right)\non\\ &=& \sum_{i\in G_j} K_\lambda(x_0, x_i) -e^{\beta_{j0}}\cdot\sum_{i=1}^NK_\lambda(x_0, x_i)\frac{e^{\beta_j^T(x_i-x_0)}}{1+\sum_{l=1}^{K-1}e^{\beta_{l0} + \beta_l^T(x_i-x_0)}}\non\\ &=&0.\non \end{eqnarray}\]

Therefore we know that

\[\begin{equation} \exp(\hat\beta_{j0}) \propto \sum_{i\in G_j} K_\lambda(x_0, x_i).\non \end{equation}\]

The proof is now complete.