Ex. 4.2

Ex. 4.2

Suppose we have features xRp, a two-class response, with class sizes N1, N2, and the target coded as N/N1, N/N2.

(a)

Show that the LDA rule classifies to class 2 if

(1)xTΣ^1(μ^2μ^1)>12(μ^2+μ^1)TΣ^1(μ^2μ^1)log(N2/N1),

and class 1 otherwise.

(b)

Consider minimization of the least squares criterion

i=1N(yiβ0xiTβ)2.

Show that the solution β^ satisfies

[(N2)Σ^+NΣ^B]β=N(μ^2μ^1)

(after simplification), where Σ^B=N1N2N(μ^2μ^1)(μ^2μ^1)T.

(c)

Hence show that Σ^Bβ is in the direction (μ^2μ^1) and thus

β^Σ^1(μ^2μ^1).

Therefore the least-squares regression coefficient is identical to the LDA coefficient, up to a scalar mupliple.

(d)

Show that this result holds for any (distinct) coding of the two classes.

(e)

Find the solution β^0 (up to the same scalar multiple as in (c)), and hence the predicted value f^(x)=β^0+xTβ^. Consider the following rule: classify to class 2 if f^(x)>0 and class 1 otherwise. Show this is not the same as the LDA rule unless the classes have equal numbers of observations.

(The use of multiple measurements in taxonomic problems, Pattern Recognition and Neural Networks)

Soln. 4.2
(a)

We have π1=N1/N and π2=N2/N. The conclusion follows directly by (4.9) in the textbook.

(b)

We start by introducing notations used in Chapter 3.

Let xiT=(xi1,...,xip)R1×p, 1T=(1,...,1)R1×p, YT=(y1,...,yN)R1×N, βT=(β1,...,βp)R1×p.

X=(1x11x1p1x21x2p1xN1xNp)=(1x1T1x2T1xNT)RN×(p+1)

and

XT=(111x11x21xN1x1px2pxNp)=(111x1x2xN)R(p+1)×N.

So that we have

XTX=(Ni=1NxiTi=1Nxii=1NxixiT)R(p+1)×(p+1).

From knowledge in linear regression, e.g., (3.6) in the textbook, we have

XTX(β0β)=(Ni=1NxiTi=1Nxii=1NxixiT)(β0β)=XTY=(i=1Nyii=1Nyixi)R(p+1)×1.

Therefore we have

Nβ0+(i=1NxiT)β=i=1Nyi(2)β0i=1Nxi+(i=1NxixiT)β=i=1Nyixi.

From the first equation above, we get

(3)β0=1N(i=1Nyi(i=1NxiT)β),

plug above into (2) and solve for β we obtain

(4)[i=1NxixiT1N(i=1Nxi)(i=1NxiT)]β=i=1Nyixi1N(i=1Nyi)(i=1Nxi).

We pause here and turn to μ^1,μ^2 and Σ. Let's we encode yi for class 1 as t1 and yi for class 2 as t2. Note that for this particular problem (b), t1=N/N1 and t2=N/N2. By definition, we have

μ^1=1N1i:gi=1xi,μ^2=1N2i:gi=2xii=1Nxi=N1μ^1+N2μ^2,i=1NxiT=N1μ^1T+N2μ^2Ti=1Nyi=N1t1+N2t2,i=1Nyixi=t1N1μ^1+t2N2μ^2

and

Σ^=1N2(i:gi=1(xiμ^1)(xiμ^1)T+i:gi=2(xiμ^2)(xiμ^2)T).

From equations above we can rewrite

i=1NxixiT=(N2)Σ^+N1μ^1μ^1T+N2μ^2μ^2T.

Now we turn back to (4). For the LHS,

i=1NxixiT1N(i=1Nxi)(i=1NxiT)=i=1NxixiT1N(N1μ^1+N2μ^2)(N1μ^1T+N2μ^2T)=(N2)Σ^+N1μ^1μ^1T+N2μ^2μ^2T1N(N1μ^1+N2μ^2)(N1μ^1T+N2μ^2T)=(N2)Σ^+N1N2N(μ^2μ^1)(μ^2μ^1)T(5)=(N2)Σ^+NΣ^B,

where Σ^B=N1N2N2(μ^2μ^1)(μ^2μ^1)T.

For the RHS of (4),

i=1Nyixi1N(i=1Nyi)(i=1Nxi)=t1N1μ^1+t2N2μ^21N(N1t1+N2t2)(N1μ^1+N2μ^2)(6)=N1N2N(t2t1)(μ^2μ^1).

Combining (4), (5) and (6) we get

(7)[(N2)Σ^+NΣ^B]β=N1N2N(t2t1)(μ^2μ^1).

Note that t1=N/N1, t2=N/N2 and N=N1+N2, so that t2t1=N2N1N2.

Thus, (6) reduces to N(μ^2μ^1) and we finish the proof.

(c)

We have

Σ^Bβ=N1N2N2(μ^2μ^1)(μ^2μ^1)Tβ

where (μ^2μ^1)TβR is a scalar, thus Σ^Bβ is in the direction (μ^2μ^1). By part (b) we have

β^Σ^1(μ^2μ^1).
(d)

This follows directly from (7) for t1t2.

(e)

Assuming the encoding of N/N1 and N/N2, by (3) we have

β^0=1N(i=1Nyi(i=1NxiT)β)=1N(i=1NxiT)β^(8)=1N(N1μ^1T+N2μ^2T)β^

so that

f^(x)=β^0+xTβ^=[xT1N(N1μ^1T+N2μ^2T)]β^.

Since β^Σ^1(μ^2μ^1), there exists λ>0 (up to a scalar constant, i.e., we can flip the classification sign if λ<0) such that β^=λΣ^1(μ^2μ^1). Therefore, f^(x)>0 is equivalent to

[xT1N(N1μ^1T+N2μ^2T)]Σ^1(μ^2μ^1)>0,

which is equivalent to LDA rule (1) when N1=N2. When N1N2, log(N2/N1)0 in (1) so they are not equivalent.