Ex. 10.10

Show that for \(K=2\) class classification, only one tree needs to be grown at each gradient-boosting iteration.

Soln. 10.10

For classification the loss function (for a single training sample) is the multinomial deviance

\[\begin{eqnarray} L(y, p(x)) &=& -\sum_{k=1}^K\bb{1}(y=G_k)\log p_k(x)\non\\ &=&-\sum_{k=1}^K\bb{1}(y=G_k)f_k(x) +\log\left(\sum_{l=1}^Ke^{f_l(x)}\right),\non \end{eqnarray}\]

where

\[\begin{equation} p_k(x) = \frac{e^{f_k(x)}}{\sum_{l=1}^Ke^{f_l(x)}}.\non \end{equation}\]

Then \(K\) least square trees are constructed at each iteration, with each tree is fit to its respective negative gradient

\[\begin{equation} \bb{1}(y=G_k) - p_k(x) \text{ for } k=1,...,K.\non \end{equation}\]

When \(K=2\), we have \(p_1(x) + p_2(x) = 1\). When we build the first least square tree \(T_1\), it is fit to

\[\begin{equation} \bb{1}(y=G_1) - p_1(x),\non \end{equation}\]

which is the negative of the target of the second tree \(T_2\):

\[\begin{equation} \bb{1}(y=G_2) - p_2(x).\non \end{equation}\]

To see that, note

\[\begin{equation} \bb{1}(y=G_1) - p_1(x) = (1-\bb{1}(y=G_2)) - (1-p_2(x)) = -(\bb{1}(y=G_2) - p_2(x)).\non \end{equation}\]

Therefore, once we build \(T_1\), we can flip the sign and get \(T_2\).