Ex. 8.5

Suggest generalizations of each of the loss functions in Figure 10.4 to more than two classes, and design an appropriate plot to compare them.

Soln. 8.5

Following the idea of Multi-class adaboost (see Ex. 10.5 as well), for a \(K\)-class classification problem, consider the coding \(Y=(Y_1,...,Y_K)^T\) with

\[\begin{equation} Y_k = \begin{cases} 1, & \text{ if } G=\mathcal{G}_k\\ -\frac{1}{K-1}, & \text{ otherwise. } \end{cases}\non \end{equation}\]

Let \(f=(f_1,...,f_K)^T\) with \(\sum_{k=1}^Kf_k=0\). The exponential loss is defined by

\[\begin{equation} L(Y,f) = \exp\left(-\frac{1}{K}Y^Tf\right).\non \end{equation}\]

Similarly, the multinomial deviance loss is defined by

\[\begin{equation} L(Y,f) = \log\left(1+\exp\left(-\frac{1}{K}Y^Tf\right)\right).\non \end{equation}\]

For misclassification loss, we can further restrict \(f\) to be in the same form as \(Y\), that is,

\[\begin{equation} f_k = \begin{cases} 1, & \text{ if } G=\mathcal{G}_k\\ -\frac{1}{K-1}, & \text{ otherwise. } \end{cases}\non \end{equation}\]

When \(K=2\), this coincides with the decision boundary \(f(x)=0\). Therefore, we can let the loss be

\[\begin{equation} L(Y,f) = \textbf{1}(Y^Tf \ge 0).\non \end{equation}\]

Similar to misclassification loss, the square error would be

\[\begin{equation} L(Y, f) = \|Y-f\|_2^2.\non \end{equation}\]

The support vector error is

\[\begin{equation} L(Y, f) = (1-Y^Tf)_+.\non \end{equation}\]

As for the plot, it suffices to change the \(x\)-axis in Figure 10.4 from \(yf\) to \(Y^Tf\).