Ex. 10.8

Consider a \(K\)-class problem where the targets \(y_{ik}\) are coded as 1 if observation \(i\) is in class \(k\) and zero otherwise. Suppose we have a current model \(f_k(x), k=1,...,K\), with \(\sum_{k=1}^Kf_k(x)=0\) (see (10.21) in Section 10.6). We wish to update the model for observations in a region \(R\) in predictor space, by adding constants \(f_k(x)+\gamma_k\), with \(\gamma_K=0\).

(a) Write down the multinomial log-likelihood for this problem, and its first and second derivatives.

(b) Using only the diagonal of the Hessian matrix in (a), and starting from \(\gamma_k=0 \forall k\), show that a one-step approximate Newton update for \(\gamma_k\) is

\[\begin{equation} \gamma_k^1 = \frac{\sum_{x_i\in R}(y_{ik}-p_{ik})}{\sum_{x_i\in R}p_{ik}(1-p_{ik})}, k =1,...,K-1,\non \end{equation}\]

where \(p_{ik}=\exp(f_k(x_i))/\exp(\sum_{l=1}^Kf_l(x_i))\).

(c) We prefer our update to sum to zero, as the current model does. Using symmetry arguments, show that

\[\begin{equation} \hat{\gamma}_k = \frac{K-1}{K}(\gamma_k^1 - \frac{1}{K}\sum_{l=1}^K\gamma_l^1), k= 1,...,K \non \end{equation}\]

is an appropriate update, where \(\gamma_k^1\) is defined as in (10.57) for all \(k=1,...,K\).

Soln. 10.8

(a) The multinomial log-likelihood for a single training sample is

\[\begin{equation} L(y, p(x), \gamma) = \sum_{k=1}^K\bb{1}(y=G_k)(f_k(x) + \gamma_k) - \log\left(\sum_{j=1}^Ke^{f_j(x)+\gamma_j}\right).\non \end{equation}\]

For \(k=1,...,K-1\), we have

\[\begin{equation} \frac{\partial L(y, p(x),\gamma)}{\partial \gamma_k} = \sum_{k=1}^K\bb{1}(y=G_k) - \frac{e^{f_k(x)+\gamma_k}}{\sum_{j=1}^K e^{f_j(x)+\gamma_j}},\non \end{equation}\]

and

\[\begin{equation} \frac{\partial^2 L(y, p(x),\gamma)}{\partial \gamma_k\partial \gamma_k} = \frac{e^{f_k(x)+\gamma_k}}{\sum_{j=1}^K e^{f_j(x)+\gamma_j}} - \frac{e^{2f_k(x)+2\gamma_k}}{(\sum_{j=1}^K e^{f_j(x)+\gamma_j})^2}.\non \end{equation}\]

Note that this is the diagonal of the Hessian matrix.

For \(k\neq k'\in \{1,...,K-1\}\), we have

\[\begin{equation} \frac{\partial^2 L(y, p(x),\gamma)}{\partial \gamma_k\partial \gamma_k'} = - \frac{e^{f_k(x)+\gamma_k}e^{f_{k'}(x)+\gamma_{k'}}}{(\sum_{j=1}^K e^{f_j(x)+\gamma_j})^2}.\non \end{equation}\]

(b) By Newton's method we have

\[\begin{equation} \gamma_k^1 = \gamma_k^0 - \frac{\partial L(y, p(x),\gamma_k^0)}{\partial \gamma_k}\Bigg/\frac{\partial^2 L(y, p(x),\gamma_k^0)}{\partial \gamma_k\partial \gamma_k}.\non \end{equation}\]

Note that we need to sum over all \(x_i \in R\) for first and second derivatives obtained in (a). Therefore we get

\[\begin{eqnarray} \gamma_k^1 &=& \gamma_k^0 - \sum_{x_i\in R}\frac{\partial L(y, p(x_i),\gamma_k^0)}{\partial \gamma_k}\Bigg/\sum_{x_i\in R}\frac{\partial^2 L(y, p(x_i),\gamma_k^0)}{\partial \gamma_k\partial \gamma_k}\non\\ &=& 0 - \sum_{x_i\in R}\left(\sum_{k=1}^K\bb{1}(y_i=G_k) - \frac{e^{f_k(x_i)}}{\sum_{j=1}^K e^{f_j(x_i)}}\right)\Bigg/\left(\frac{e^{f_k(x_i)}}{\sum_{j=1}^K e^{f_j(x_i)}} - \frac{e^{2f_k(x_i)}}{(\sum_{j=1}^K e^{f_j(x_i)})^2}\right)\non\\ &=&-\frac{\sum_{x_i\in R}(y_{ik}-p_{ik})}{\sum_{x_i\in R}p_{ik}(1-p_{ik})}\non \end{eqnarray}\]

where \(p_{ik} = e^{f_k(x_i)}/\sum_{j=1}^Ke^{f_j(x_i)}\).

(c) Given \(\gamma_k^1\) in part (b), we only need to subtract the average of \(\sum_{l=1}^K\gamma_l^1\) to each \(\gamma_k^1\) to make them sum up to 0. Thus we have

\[\begin{equation} \hat\gamma_k = \gamma_k^1-\frac{1}{K}\sum_{l=1}^K\gamma_l^1.\non \end{equation}\]

We can, of course, multiply any \(\alpha > 0\) to \(\hat\gamma_k\), as the textbook chooses \(\frac{K-1}{K}\):

\[\begin{equation} \hat\gamma_k = \frac{K-1}{K}\left(\gamma_k^1-\frac{1}{K}\sum_{l=1}^K\gamma_l^1\right).\non \end{equation}\]

It's straightforward to verify that \(\sum_{k=1}^K\hat\gamma_k=0\).