Ex. 4.8
Ex. 4.8
Consider the multivariate Gaussian model \(X|G=k \sim N(\mu_k, \bm{\Sigma})\), with the additional restriction that rank\(\{\mu_k\}_1^K = L < \max(K-1,p).\) Derive the constrained MLEs for the \(\mu_k\) and \(\bm{\Sigma}\). Show that the Bayes classification rule is equivalent to classifying in the reduced subspace computed by LDA Discriminant adaptive nearest neighbor classification.
Soln. 4.8
We are going to maximize
subject to \(\text{rank}\{\mu_k\}_{k=1}^K = L < \max(K-1,p)\).
Let \(B\) be the between-class covariance matrix and for fixed \(\bm{\Sigma}\), let \(V\) denote the matrix of leading \(K\) eigenvectors of \(\bm{\Sigma}^{-1}B\).
- (a) \(\Sigma\) Known, \(\mu_j\) Unknown.
In Section 12.5.2 of Multivariate Analysis, a solution to \(\eqref{eq:4.8loglikelihood}\) is given assuming that \(\Sigma\) is known. This has the form of the usual LDA solution, except with \(W\) replaced by \(\bm{\Sigma}\). We can write the estimated means as
and thus the estimated (rank \(K\)) between-matrix as
- (b) \(\mu_j\) Known, \(\bm{\Sigma}\) Unknown.
Although the case \(\mu_j\) known, \(\bm{\Sigma}\) unknown is not explicitly stated in Multivariate Analysis, we deduce (and easily check) from their equation (4.2.7) on p. 104 that
These are each obtained by solving the partial source equations for \(\mu_j\) or \(\bm{\Sigma}\), assuming that the other is known. The full maximum likelihood solution requires their simultaneous solution and suggests iteration. However, the solution is easier. We plug the estimated means \(\eqref{eq:4.8hatmuj}\) (using \(W\) for \(\bm{\Sigma}\)) into equation \(\eqref{eq:4.8hatsigam}\), which gives
where \(V_{\perp}^TWV=0\) and \(V_{\perp}^T\) spans the complementary \((p-L)-\)dimensional subspace of \(\mathbb{R}^p\). To complete the proof, we show that the same \(V\) is optimal using the new metric \(\hat{\bm{\Sigma}}\).
First note that
(1) \(V^T\hat{\bm{\Sigma}} V =V^TWV + 0 = I_L\) and
(2) \(BV = WVD_L=\hat{\bm{\Sigma}} VD_L\), where the first equality is the definition of \(V\), and \(D_K=\text{diag}(\gamma_1,...,\gamma_L)\).
We have established that \(V\) is also an eigenmatrix of \(B\) with respect to \(\hat{\bm{\Sigma}}\); we have still to show that it has remained optimal. Note that
where \(D_{p-L}\) are the eigenvalues of \(B\) corresponding to \(V_{\perp}\) and \(W\). So, the metric \(\hat{\bm{\Sigma}}\), the columns of \(V\) are orthonormal eigenvectors of \(B\), \(V_{\perp}\) is orthogonal and orthogonal to \(V\). Thus the columns of \(V_{\perp}\) remain eigenvectors of \(B\) with respect to \(\hat{\bm{\Sigma}}\), with eigenvalues \((I_{p-L}+D_{p-L})^{-1}D_{p-L}\le D_{p-L}\), and thus the order does not change.
This shows that the constrained maximum likelihood estimated means coincide with the rank \(L\) LDA means. Using the fact that
it is not difficult to show that
coincides with relative Euclidean distances in the reduced LDA space, and hence classification based on the fitted constrained Gaussian model and LDA coincide.