Ex. 12.11

The MDA procedure models each class as a mixture of Gaussians. Hence each mixture center belongs to one and only one class. A more general model allows each mixture center to be shared by all classes. We take the joint density of labels and features to be

\[\begin{equation} P(G, X) = \sum_{r=1}^R\pi_rP_r(G,X),\non \end{equation}\]

a mixture of joint densities. Furthermore we assume

\[\begin{equation} P_r(G,X) = P_r(G)\phi(X;\mu_r,\Sigma).\non \end{equation}\]

This model consists of regions centered at \(\mu_r\), and for each there is a class profile \(P_r(G)\). The posterior class distribution is given by

\[\begin{equation} P(G=k|X=x) = \frac{\sum_{r=1}^R\pi_rP_r(G=k)\phi(x;\mu_r,\Sigma)}{\sum_{r=1}^R\pi_r\phi(x;\mu_r,\Sigma)},\non \end{equation}\]

where the denominator is the marginal distribution \(P(X)\).

(a) Show that this model (called MDA2) can be viewed as a generalization of MDA since

\[\begin{equation} P(X|G=k) = \frac{\sum_{r=1}^R\pi_rP_r(G=k)\phi(x;\mu_r,\Sigma)}{\sum_{r=1}^R\pi_rP_r(G=k)}\non \end{equation}\]

where \(\pi_{rk} = \pi_rP_r(G=k)/\sum_{r=1}^R\pi_rP_r(G=k)\) corresponds to the mixing proportions for the \(k\)th class.

(b) Derive the EM algorithm for MDA2.

(c) Show that if the initial weight matrix is constructed as in MDA, involving separate \(k\)-means clustering in each class, then the algorithm for MDA2 is identical to the original MDA procedure.

Soln. 12.11

(a) We have

\[\begin{eqnarray} P(X=x|G=k) &=& \frac{P(G=k, X=x)}{P(G=k)}\non\\ &=&\frac{\sum_{r=1}^R\pi_rP_r(G=k)\phi(x;\mu_r,\Sigma)}{\sum_{r=1}^R\pi_rP_r(G=k)}\non\\ &=&\sum_{r=1}^R\pi_{kr}\phi(x;\mu_r, \Sigma),\non \end{eqnarray}\]

where \(\pi_{kr} = \pi_rP_r(G=k)/\sum_{r=1}^R\pi_rP_r(G=k)\).

Intuitively, we can think that, in MDA, there are total of \(R=\sum_{k=1}^KR_k\) mixture centers shared by all the \(K\) classes, with additional conditions that the joint densities of class \(k\) and center \(r\notin \{R_k\}\) are zero, where \(\{R_k\}\) represents the set of mixtures assigned to class \(k\).

(b) The EM algorithm is similar to that of MDA. Specifically, we have

E-step: Given current parameters (\(\pi_r, P_r(G=k), \mu_r, \Sigma\)), compute the responsibility \(W(c_{kr}|x_i, g_i)\) for each of class-\(k\) observation \(x_i\)

\[\begin{equation} W(c_{kr}|x_i, g_i) = \frac{\pi_r P_r(G=k)\phi(x_i; \mu_r, \Sigma)}{\sum_{l=1}^R\pi_l P_l(G=k)\phi(x_i;\mu_l, \Sigma)}.\non \end{equation}\]

M-step: Compute the weighted MLEs for the parameters of each of the component Gaussians within each of the classes, using the weights from the E-step.

(c) For MDA, the \(k\)-means clustering initialization partitions the observations into \(R_k\) disjoint groups, thus constructing an initial weight matrix consisting of zeros and ones.

Note that from (a), MDA2 is a generalization of MDA. Once \(k\)-means clustering initialization is done for MDA2, we can think the total \(R\) mixtures for MDA2 are partitioned in a summation \(R=\sum_{k=1}^KR_k\), then the following steps for MDA and MDA2 from here become identical.