Ex. 4.8

Consider the multivariate Gaussian model \(X|G=k \sim N(\mu_k, \bm{\Sigma})\), with the additional restriction that rank\(\{\mu_k\}_1^K = L < \max(K-1,p).\) Derive the constrained MLEs for the \(\mu_k\) and \(\bm{\Sigma}\). Show that the Bayes classification rule is equivalent to classifying in the reduced subspace computed by LDA Discriminant adaptive nearest neighbor classification.

Soln. 4.8

We are going to maximize

\[\begin{equation}\label{eq:4.8loglikelihood} l(\mu,\Sigma) = -\frac{1}{2}\sum_{l=1}^K\sum_{g_i=l}(x_i-\mu_l)^T\bm{\Sigma}^{-1}(x_i-\mu_l) - N\log|\bm{\Sigma}| \end{equation}\]

subject to \(\text{rank}\{\mu_k\}_{k=1}^K = L < \max(K-1,p)\).

Let \(B\) be the between-class covariance matrix and for fixed \(\bm{\Sigma}\), let \(V\) denote the matrix of leading \(K\) eigenvectors of \(\bm{\Sigma}^{-1}B\).

(a) \(\Sigma\) Known, \(\mu_j\) Unknown.

In Section 12.5.2 of Multivariate Analysis, a solution to \(\eqref{eq:4.8loglikelihood}\) is given assuming that \(\Sigma\) is known. This has the form of the usual LDA solution, except with \(W\) replaced by \(\bm{\Sigma}\). We can write the estimated means as

\[\begin{equation} \label{eq:4.8hatmuj} \hat\mu_j = \bm{\Sigma} V V^T(\bar x_j-\bar x) + \bar x \end{equation}\]

and thus the estimated (rank \(K\)) between-matrix as

\[\begin{equation} \hat B_{(K)} = \bm{\Sigma} V V^TBVV^T\bm{\Sigma}.\non \end{equation}\]

(b) \(\mu_j\) Known, \(\bm{\Sigma}\) Unknown.

Although the case \(\mu_j\) known, \(\bm{\Sigma}\) unknown is not explicitly stated in Multivariate Analysis, we deduce (and easily check) from their equation (4.2.7) on p. 104 that

\[\begin{equation}\label{eq:4.8hatsigam} \hat{\bm{\Sigma}} = W + \sum_{k=1}^K\frac{N_k}{N}(\bar x_k-\mu_k)(\bar x_k-\mu_k)^T. \end{equation}\]

These are each obtained by solving the partial source equations for \(\mu_j\) or \(\bm{\Sigma}\), assuming that the other is known. The full maximum likelihood solution requires their simultaneous solution and suggests iteration. However, the solution is easier. We plug the estimated means \(\eqref{eq:4.8hatmuj}\) (using \(W\) for \(\bm{\Sigma}\)) into equation \(\eqref{eq:4.8hatsigam}\), which gives

\[\begin{eqnarray} \hat{\bm{\Sigma}} &=& W + \sum_{k=1}^K\frac{N_k}{N}(\bar x_k-\hat\mu_k)(\bar x_k-\hat\mu_k)^T\non\\ &=&W + B - \hat B_{(L)}\non\\ &=&W + B - WVV^TBVV^TW\non\\ &=&W + WV_{\perp}V_{\perp}^TBV_{\perp}V_{\perp}^TW\non \end{eqnarray}\]

where \(V_{\perp}^TWV=0\) and \(V_{\perp}^T\) spans the complementary \((p-L)-\)dimensional subspace of \(\mathbb{R}^p\). To complete the proof, we show that the same \(V\) is optimal using the new metric \(\hat{\bm{\Sigma}}\).

First note that

(1) \(V^T\hat{\bm{\Sigma}} V =V^TWV + 0 = I_L\) and

(2) \(BV = WVD_L=\hat{\bm{\Sigma}} VD_L\), where the first equality is the definition of \(V\), and \(D_K=\text{diag}(\gamma_1,...,\gamma_L)\).

We have established that \(V\) is also an eigenmatrix of \(B\) with respect to \(\hat{\bm{\Sigma}}\); we have still to show that it has remained optimal. Note that

\[\begin{eqnarray} V^T_{\perp}\hat{\bm{\Sigma}} V_{\perp} &=& I_{p-L} + V_{\perp}^TB V_{\perp}\non\\ &=&I_{p-L}+D_{p-L}\non \end{eqnarray}\]

where \(D_{p-L}\) are the eigenvalues of \(B\) corresponding to \(V_{\perp}\) and \(W\). So, the metric \(\hat{\bm{\Sigma}}\), the columns of \(V\) are orthonormal eigenvectors of \(B\), \(V_{\perp}\) is orthogonal and orthogonal to \(V\). Thus the columns of \(V_{\perp}\) remain eigenvectors of \(B\) with respect to \(\hat{\bm{\Sigma}}\), with eigenvalues \((I_{p-L}+D_{p-L})^{-1}D_{p-L}\le D_{p-L}\), and thus the order does not change.

This shows that the constrained maximum likelihood estimated means coincide with the rank \(L\) LDA means. Using the fact that

\[\begin{equation} \hat{\bm{\Sigma}}^{-1} = VV^T + V_{\perp}(D^{-1}_{p-L}(I_{p-L}+D_{p-L}))V_{\perp}^T,\non \end{equation}\]

it is not difficult to show that

\[\begin{equation} (x-\hat\mu_j)^T\hat{\bm{\Sigma}}^{-1}(x-\hat\mu_j) - (x-\hat\mu_l)^T\hat{\bm{\Sigma}}^{-1}(x-\hat\mu_l) \non \end{equation}\]

coincides with relative Euclidean distances in the reduced LDA space, and hence classification based on the fitted constrained Gaussian model and LDA coincide.