Ex. 10.5

Ex. 10.5

Multiclass exponential loss (Multi-class adaboost). For a K-class classification problem, consider the coding Y=(Y1,...,YK)T with

Yk={1, if G=Gk1K1, otherwise. 

Let f=(f1,...,fK)T with k=1Kfk=0, and define

L(Y,f)=exp(1KYTf).
(a)

Using Lagrange multipliers, derive the population minimizer f of L(Y,f), subject to the zero-sum constraint, and relate these to the class probabilities.

(b)

Show that a multiclass boosting using this loss function leads to a reweighting algorithm similar to AdaBoost, as in Section 10.4.

Soln. 10.5
(a)

We follow the similar arguments in Ex. 10.2. We're interested in

argminf(x)EY|x[exp(1K(Y1f1(x)+...+YKfK(x)))]

subject to

f1(x)+...+fK(x)=0.

The Lagrange multiplier, denoted as H(f1,...,fk,λ), becomes

P(Y=1|x)exp(1Kf1(x)1K1(f2(x)+...+fK(x)))+P(Y=2|x)exp(1Kf2(x)1K1(f1(x)+f3(x)...+fK(x)))+...+P(Y=K|x)exp(1KfK(x)1K1(f1(x)+f2(x)...+fK1(x)))(1)λ(f1(x)+...+fK(x))

Note that f1(x)+...+fK(x)=0, so (1) simplifies to

P(Y=1|x)exp(1K1f1(x))+P(Y=2|x)exp(1K1f2(x))+...+P(Y=K|x)exp(1K1fK(x))(2)λ(f1(x)+...+fK(x)).

For k=1,...,K, setting Hfk(x)=0 and Hλ=0 yields

1K1P(Y=k|x)exp(1K1fk(x))λ=0, for k=1,...,Kf1(x)+...+fK(x)=0.

Note that we have K+1 equations with K+1 unknowns. We first represent fk(x) in terms of λ by the first set of equations above:

fk(x)=(K1)log((K1)λP(Y=k|x))(3)=(K1)log(P(Y=k|x))(K1)log((K1)λ)

Then by k=1Kfk(x)=0 we have

k=1K((K1)log(P(Y=k|x))(K1)log((K1)λ))=0

so that

(4)λ=1K1k=1KP(Y=k|x)1K.

Plug λ above back to (3) we obtain that for k=1,...,K,

fk(x)=(K1)log(P(Y=k|x))K1Kk~=1KlogP(Y=k~|x).

Once fk(x) are estimated for k=1,...,K, we are able to estimate P(Y=k|x) in terms of fk(x). From (3) and (4), we obtain for k=1,...,K,

(5)P(Y=k|x)=exp(fk(x)K1)(j=1KP(Y=j|x)1K).

Summing P(Y=k|x) for k=1,...,K we obtain

1=(j=1KP(Y=j|x)1K)(k=1Kexp(fk(x)K1)),

thus

(6)(j=1KP(Y=j|x)1K)=(k=1Kexp(fk(x)K1))1.

Plugging equation above back to (5) we get

P(Y=k|x)=exp(fk(x)K1)k=1Kexp(fk(x)K1).
(b)

Following the idea of two-class AdaBoost, we start with a Stagewise Additive Modeling using a Multi-class Exponential loss function (SAMME).

SAMME

Note that SAMME shares the same simple modular structure of AdaBoost with a simple but subtle different in (c), specifically, the extra term log(K1).

The link between exponential loss function and SAMME follows the same arguments in Section 10.4. Specifically, we have as (10.12),

βm=(K1)2K(log1errmerrm+log(K1)).