Ex. 6.5
Ex. 6.5
Show that fitting a locally constant multinomial logit model of the form (6.19) amounts to smoothing the binary response indicators for each class separately using a Nadaraya-Watson kernel smoother with kernel weights \(K_\lambda(x_0, x_i)\).
Soln. 6.5
If we smooth binary response indicators for each class separately using a Nadaraya-Watson kernel smoother, recall (6.2) in the text, for any class \(j\in \{1,...,K\}\), \(y_i=1\) if and only if \(i\in G_j\) where \(G_j\) is the set of indices that belong to class \(j\). Then we have
That means for \(x_0\), we classify it to class \(j\) that maximizes \(\sum_{i\in G_j} K_\lambda(x_0, x_i)\).
On the other hand, consider the local multinomial logit model (6.19) in the text. From (6.20) in the text,
we know that we classify \(x_0\) to class \(j\) that maximizes \(\hat\beta_{j0}\) for each class \(j\). It thus suffices to show that \(\hat\beta_{j0}\) is a non-decreasing function of \(\sum_{i\in G_j} K_\lambda(x_0, x_i)\).
Let \(\beta\) denote the parameter set \(\{\beta_{k0}, \beta_k, k=1,...,K-1\}\). We code the class from \(1\) to \(K\) so that the log-likelihood \(l(\beta, x_0)\) can be rewritten as (see, e.g., Ex. 4.4)
To maximize the log-likelihood, we need to set its derivatives to zero and then can solve the equations to find \(\hat\beta_{k0}\). These equations, for \(j=1,...,K-1\), are
Therefore we know that
The proof is now complete.