Ex. 4.3

Ex. 4.3

Suppose we transform the original predictors \(\bb{X}\) to \(\boldsymbol{\hat Y}\) via linear regression. In detail, let \(\boldsymbol{\hat Y} = \bb{X}(\bb{X}^T\bb{X})^{-1}\bb{X}^T\bb{Y} = \bb{X}\hat B\) where \(\bb{Y}\) is the indicator response matrix. Similarly for any input \(x\in\mathbb{R}^p\), we get a transformed vector \(\hat y = \hat B^Tx\in \mathbb{R}^K\). Show that LDA using \(\boldsymbol{\hat Y}\) is identical to LDA in the original space.

Soln. 4.3

We start by introducing notations used in Chapter 3.

Let \(x_i^T = (x_{i1}, ..., x_{ip})\in \mathbb{R}^{1\times p}\), \(\bb{1}^T = (1,...,1)\in \mathbb{R}^{1\times p}\), \(Y^T = (y_1, ..., y_N)\in \mathbb{R}^{1\times N}\), \(\beta^T = (\beta_{1}, ..., \beta_{p})\in \mathbb{R}^{1\times p}\). Let

\[\begin{equation} \bb{X} = \begin{pmatrix} 1 & x_{11} & \cdots & x_{1p} \\ 1 & x_{21} & \cdots & x_{2p} \\ \vdots & \vdots & \ddots & \vdots \\ 1 & x_{N1} & \cdots & x_{Np} \end{pmatrix} = \begin{pmatrix} 1 & x_1^T\\ 1 & x_2^T\\ \vdots & \vdots\\ 1 & x_N^T \end{pmatrix}\in \mathbb{R}^{N\times (p+1)}\non \end{equation}\]

and

\[\begin{equation} \bb{X}^T = \begin{pmatrix} 1 & 1 & \cdots & 1 \\ x_{11} & x_{21} & \cdots & x_{N1} \\ \vdots & \vdots & \ddots & \vdots \\ x_{1p} & x_{2p} & \cdots & x_{Np} \end{pmatrix} = \begin{pmatrix} 1 & 1 & \cdots & 1\\ x_1 & x_2 & \cdots & x_N \end{pmatrix}\in \mathbb{R}^{(p+1)\times N}.\non \end{equation}\]

We have

\[\begin{equation} \boldsymbol{\hat Y} = \bb{X}(\bb{X}^T\bb{X})^{-1}\bb{X}^T\bb{Y} = \bb{X}\hat B\in \mathbb{R}^{N\times k},\non \end{equation}\]

and \(\hat y = \hat B^Tx\) for a single training sample \(x\). We estimate the new parameters of the Gaussian distributions from transformed data, denoted by \(\pi_k^{\text{new}}\), \(\hat\mu_k^{\text{new}}\) and \(\hat\Sigma^{\text{new}}\), and link them back with \(\pi_k\), \(\hat\mu_k\) and \(\hat\Sigma\) estimated from original training data.

First, \(\pi_k^{\text{new}} = \pi_k\) for \(k=1,...,K\) since the training sample classification does not change.

Second, by definition of \(\hat\mu_k^{\text{new}}\), note again the training sample classification does not change, we have

\[\begin{eqnarray} \hat\mu_k^{\text{new}} &=& \sum_{g_i=k}\frac{\hat B^Tx_i}{N_k}\non\\ &=&\hat B^T\sum_{g_i=k}\frac{x_i}{N_k}\non\\ &=&\hat B^T \hat\mu_k.\non \end{eqnarray}\]

Third, by definition of \(\hat\Sigma^{\text{new}}\) and result above, we have

\[\begin{eqnarray} \hat\Sigma^{\text{new}} &=& \sum_{k=1}^K\sum_{g_i=k}(\hat B^Tx_i-\hat\mu_k^{\text{new}})(\hat B^Tx_i-\hat\mu_k^{\text{new}})^T/(N-K)\non\\ &=&\frac{1}{N-K}\sum_{k=1}^K\sum_{g_i=k}\hat B^T(x_i-\hat\mu_k)(x_i - \hat \mu_k)^T\hat B\non\\ &=&\hat B^T\left[\frac{1}{N-K}\sum_{k=1}^K\sum_{g_i=k}(x_i-\hat\mu_k)(x_i - \hat \mu_k)^T\right]\hat B\non\\ &=&\hat B^T\hat\Sigma\hat B.\non \end{eqnarray}\]

Therefore, the new linear discriminant function is

\[\begin{eqnarray} \delta_k^{\text{new}}(x) &=& (\hat B^Tx)^T(\hat\Sigma^{\text{new}})^{-1}\hat\mu_k^{\text{new}} - \frac{1}{2}(\hat\mu_k^{\text{new}})^T(\hat\Sigma^{\text{new}})^{-1}\hat\mu_k^{\text{new}} + \log \pi_k^{\text{new}}\non\\ &=&x^T\hat B (\hat B)^{-1}(\hat\Sigma)^{-1}(\hat B^T)^{-1}\hat B^T \hat\mu_k -\frac{1}{2}\hat\mu_k^T\hat B (\hat B)^{-1}(\hat\Sigma)^{-1}(\hat B^T)^{-1}\hat B^T\hat\mu_k + \log \pi_k\non\\ &=&x^T\hat\Sigma^{-1}\hat\mu_k-\frac{1}{2}\hat\mu_k^T(\hat\Sigma)^{-1}\hat\mu_k + \log \pi_k, \non \end{eqnarray}\]

which is identical to the discriminant function used in the original space.