Ex. 12.11

Ex. 12.11

The MDA procedure models each class as a mixture of Gaussians. Hence each mixture center belongs to one and only one class. A more general model allows each mixture center to be shared by all classes. We take the joint density of labels and features to be

P(G,X)=r=1RπrPr(G,X),

a mixture of joint densities. Furthermore we assume

Pr(G,X)=Pr(G)ϕ(X;μr,Σ).

This model consists of regions centered at μr, and for each there is a class profile Pr(G). The posterior class distribution is given by

P(G=k|X=x)=r=1RπrPr(G=k)ϕ(x;μr,Σ)r=1Rπrϕ(x;μr,Σ),

where the denominator is the marginal distribution P(X).

(a) Show that this model (called MDA2) can be viewed as a generalization of MDA since

P(X|G=k)=r=1RπrPr(G=k)ϕ(x;μr,Σ)r=1RπrPr(G=k)

where πrk=πrPr(G=k)/r=1RπrPr(G=k) corresponds to the mixing proportions for the kth class.

(b) Derive the EM algorithm for MDA2.

(c) Show that if the initial weight matrix is constructed as in MDA, involving separate k-means clustering in each class, then the algorithm for MDA2 is identical to the original MDA procedure.

Soln. 12.11

(a) We have

P(X=x|G=k)=P(G=k,X=x)P(G=k)=r=1RπrPr(G=k)ϕ(x;μr,Σ)r=1RπrPr(G=k)=r=1Rπkrϕ(x;μr,Σ),

where πkr=πrPr(G=k)/r=1RπrPr(G=k).

Intuitively, we can think that, in MDA, there are total of R=k=1KRk mixture centers shared by all the K classes, with additional conditions that the joint densities of class k and center r{Rk} are zero, where {Rk} represents the set of mixtures assigned to class k.

(b) The EM algorithm is similar to that of MDA. Specifically, we have

E-step: Given current parameters (πr,Pr(G=k),μr,Σ), compute the responsibility W(ckr|xi,gi) for each of class-k observation xi

W(ckr|xi,gi)=πrPr(G=k)ϕ(xi;μr,Σ)l=1RπlPl(G=k)ϕ(xi;μl,Σ).

M-step: Compute the weighted MLEs for the parameters of each of the component Gaussians within each of the classes, using the weights from the E-step.

(c) For MDA, the k-means clustering initialization partitions the observations into Rk disjoint groups, thus constructing an initial weight matrix consisting of zeros and ones.

Note that from (a), MDA2 is a generalization of MDA. Once k-means clustering initialization is done for MDA2, we can think the total R mixtures for MDA2 are partitioned in a summation R=k=1KRk, then the following steps for MDA and MDA2 from here become identical.