Soln. 3.21
The problem can be rewritten as
\[\begin{eqnarray}
\underset{\text{rank}(\bb{B})=m}{\operatorname{argmin}} \text{trace}\left[(\bY-\bX\bb{B})\bm{\Sigma}^{-1}(\bY-\bX\bb{B})^T\right].\non
\end{eqnarray}\]
By properties of trace operator and the fact that \(\bm{\Sigma}\) is definite positive and symmetric, we have
\[\begin{eqnarray}
&&\text{trace}\left[(\bY-\bX\bb{B})\bm{\Sigma}^{-1}(\bY-\bX\bb{B})^T\right]\non\\
&=&\text{trace}\left[(\bY\bm{\Sigma}^{-\frac{1}{2}}-\bX\bb{B}\bm{\Sigma}^{-\frac{1}{2}})(\bY\bm{\Sigma}^{-\frac{1}{2}}-\bX\bb{B}\bm{\Sigma}^{-\frac{1}{2}})^T\right]\non\\
&=&\text{trace}\left[(\bY\bm{\Sigma}^{-\frac{1}{2}}-\bX\bb{B}\bm{\Sigma}^{-\frac{1}{2}})(\bm{\Sigma}^{-\frac{1}{2}}\bY^T-\bm{\Sigma}^{-\frac{1}{2}}\bB^T\bX^T)\right]\non\\
&=&\text{trace}\left[\bY\bm{\Sigma}^{-1}\bY^T + \bX\bB\bm{\Sigma}^{-\frac{1}{2}}\bm{\Sigma}^{-\frac{1}{2}}\bB^T\bX^T - 2\bm{\Sigma}^{-\frac{1}{2}}\bY^T\bX\bB\bm{\Sigma}^{-\frac{1}{2}} \right]\non\\
&=&\text{trace}\left[\bY\bm{\Sigma}^{-1}\bY^T + \bm{\Sigma}^{-\frac{1}{2}}\bB^T\bX^T\bX\bB\bm{\Sigma}^{-\frac{1}{2}} - 2\bm{\Sigma}^{-\frac{1}{2}}\bY^T\bX\bB\bm{\Sigma}^{-\frac{1}{2}} \right]\non\\
&=&\text{trace}\left[\bY\bm{\Sigma}^{-1}\bY^T+\bm{\Sigma}^{-1}\bY^T\bX(\bX^T\bX)^{-1}\bX^T\bY\right] + \text{trace}\left[\bb{C}\bb{C}^T\right]\label{eq:3-21a}
\end{eqnarray}\]
where
\[\begin{equation}
\label{eq:3-21b}
\bb{C} = \bm{\Sigma}^{-\frac{1}{2}}\bB^T(\bX^T\bX)^{\frac{1}{2}}-\bm{\Sigma}^{-\frac{1}{2}}(\bY^T\bX)(\bX^T\bX)^{-\frac{1}{2}}.\non
\end{equation}\]
Note that the first summand in \(\eqref{eq:3-21a}\) is independent of \(\bB\), therefore the original problem reduces to the standard low-rank approximation problem
\[\begin{equation}
\label{eq:3-21c}
\underset{\text{rank}(\bb{B})=m}{\operatorname{argmin}}\|\bm{\Sigma}^{-\frac{1}{2}}\bB^T(\bX^T\bX)^{\frac{1}{2}}-\bm{\Sigma}^{-\frac{1}{2}}(\bY^T\bX)(\bX^T\bX)^{-\frac{1}{2}}\|_F^2,
\end{equation}\]
where \(\|M\|_F^2 = \text{trace}(M^TM)\) is the squared Frobenius norm for matrix \(M\) (see, e.g., page 540 in the text)
By standard results in low-rank approximation (e.g., Eckart–Young–Mirsky theorem), we know the optimal \(\hat\bB(m)\) with rank \(m\) that solves problem
\(\eqref{eq:3-21c}\) satisfies
\[\begin{equation}
\label{eq:3-21d}
\bm{\Sigma}^{-\frac{1}{2}}\hat\bB(m)^T(\bX^T\bX)^{\frac{1}{2}} = \sum_{i=1}^m d_i\bb{u}_i\bb{v}_i^T
\end{equation}\]
where \(\bb{u}_i\), \(\bb{v}_i\) and \(d_i\) are obtained from the following SVD
\[\begin{equation}
\label{eq:3-21e}
\bm{\Sigma}^{-\frac{1}{2}}(\bY^T\bX)(\bX^T\bX)^{-\frac{1}{2}} = \bb{U}^\ast\bb{D}^\ast\bb{V}^{\ast T}.
\end{equation}\]
Since \(\bb{U}^\ast\) has orthogonal columns, \(\eqref{eq:3-21e}\) yields
\[\begin{equation}
\bb{U}^{\ast ^T}\bm{\Sigma}^{-\frac{1}{2}}(\bY^T\bX)(\bX^T\bX)^{-\frac{1}{2}} = \bb{D}^\ast\bb{V}^{\ast ^T}\non
\end{equation}\]
and thus by matrix transpose
\[\begin{equation}
\bb{V}^\ast\bb{D}^\ast = (\bX^T\bX)^{-\frac{1}{2}}\bX^T\bY\bm{\Sigma}^{-\frac{1}{2}}\bb{U}^\ast.\non
\end{equation}\]
Then, by \(\eqref{eq:3-21d}\), we know
\[\begin{eqnarray}
\label{eq:3-21f}
\hat \bB(m) &=& (\bX^T\bX)^{-\frac{1}{2}}\left(\sum_{i=1}^m (\bb{v}_i d_i)\cdot \bb{u}_i^T\right)\bm{\Sigma}^{\frac{1}{2}}\non\\
&=&(\bX^T\bX)^{-\frac{1}{2}}\left((\bX^T\bX)^{-\frac{1}{2}}\bX^T\bY\bm{\Sigma}^{-\frac{1}{2}}\bb{U}_m^\ast \cdot \bb{U}_m^{\ast ^T}\right)\bm{\Sigma}^{\frac{1}{2}}\non\\
&=&(\bX^T\bX)^{-1}\bX^T\bY\bb{U}_m\bb{U}_m^-\non\\
&=&\hat \bB\bb{U}_m\bb{U}_m^-,
\end{eqnarray}\]
where \(\bb{U}_m=\bm{\Sigma}^{-\frac{1}{2}}\bb{U}_m^\ast\) and \(\bb{U}_m^- = \bb{U}_m^{\ast T}\bm{\Sigma}^{\frac{1}{2}}\).
Note that by far we haven't used the assumption that \(\bm{\Sigma}=\bY^T\bY/N\). By \(\eqref{eq:3-21c}\), it's easy to see that the solution remains unchanged if we multiply a constant to \(\bm{\Sigma}\). So without loss of generality, we assume \(\bm{\Sigma} = \bY^T\bY\), then \(\eqref{eq:3-21e}\) becomes exactly the same as (3.87) in the text.