Ex. 14.2

Consider a mixture model density in \(p\)-dimensional feature space,

\[\begin{equation} g(x) = \sum_{k=1}^K\pi_kg_k(x),\non \end{equation}\]

where \(g_k=N(\mu_k, \bb{I}\cdot \sigma^2)\) and \(\pi_k\ge 0 \ \forall k\) with \(\sum_{k}\pi_k = 1\). Here \(\{\mu_k, \pi_k\}, k=1,...,K\) and \(\sigma^2\) are unknown parameters.

Suppose we have data \(x_1, x_2, ..., x_N\sim g(x)\) and we wish to fit the mixture model.

Write down the log-likelihood of the data.
Derive an EM algorithm for computing the maximum likelihood estimates (see Section 8.1)
Show that if \(\sigma\) has a known value in the mixture model and we take \(\sigma\ra 0\), then in a sense this EM algorithm coincides with \(K\)-means clustering.

Soln. 14.2

This is similar to Ex. 13.1.

(1) The log-likelihood of the data is

\[\begin{eqnarray} l(\mu_k, \pi_k, \sigma) &=& \log\left(\prod_{i=1}^Ng(x_i)\right)\non\\ &=&\sum_{i=1}^N\log\left(\sum_{k=1}^K\pi_kg_k(x_i)\right).\non \end{eqnarray}\]

(2) The EM algorithm is an iterate-until-convergence process, and each iteration consists of two steps, Expectation and Maximization. For Gaussian mixture model, the Expectation step computes the responsibilities of \(x_i\) for class \(k\) as

\[\begin{eqnarray} r_i^{(k)} &=& \frac{\hat\pi_k\hat g_k(x_i)}{\sum_{j=1}^K\hat\pi_j \hat g_j(x_i)},\non \end{eqnarray}\]

where \(\hat g_j(x) = N(\hat \mu_j, \bb{I}\cdot \hat\sigma^2)\).

The Maximization step then computes the weighted means and variances as

\[\begin{eqnarray} \hat\mu_k &=& \frac{\sum_{i=1}^Nr_i^{(k)}x_i}{\sum_{i=1}^Nr_i^{(k)}}.\non\\ \hat\sigma^2 &=& \frac{\sum_{i=1}^N\sum_{k=1}^Kr_i^{(k)}(x_i-\hat\mu_k)^T(x_i-\hat\mu_k)}{N}\non\\ \hat\pi_k &=& \frac{\sum_{i=1}^Nr_i^{(k)}}{N}.\non \end{eqnarray}\]

Note that for the general case where the variance \(\Sigma_k\) may vary for each \(k\), the estimation above becomes

\[\begin{equation} \Sigma_k = \frac{\sum_{i=1}^Nr_i^{(k)}(x_i-\hat\mu_k)(x_i-\hat\mu_k)^T}{\sum_{i=1}^Nr_i^{(k)}}.\non \end{equation}\]

(3) It's easy to very that, as \(\sigma\ra 0\), \(r_i^{(k)}\ra 1\) for the cluster \(k\) that is closest to \(x_i\) and \(r_i^{(k)}\ra 0\) for other clusters. Therefore \(K\)-means algorithm is recovered.