Ex. 4.2

Ex. 4.2

Suppose we have features \(x\in \mathbb{R}^p\), a two-class response, with class sizes \(N_1\), \(N_2\), and the target coded as \(-N/N_1\), \(N/N_2\).

(a)

Show that the LDA rule classifies to class 2 if

\[\begin{equation} \label{eq:ex43LDA} x^T\hat\Sigma^{-1}(\hat\mu_2 - \hat\mu_1) > \frac{1}{2}(\hat\mu_2 + \hat\mu_1)^T\hat\Sigma^{-1}(\hat\mu_2 - \hat\mu_1) - \log(N_2/N_1), \end{equation}\]

and class 1 otherwise.

(b)

Consider minimization of the least squares criterion

\[\begin{equation} \sum_{i=1}^N(y_i-\beta_0-x_i^T\beta)^2.\nonumber \end{equation}\]

Show that the solution \(\hat\beta\) satisfies

\[\begin{equation} \left[(N-2)\hat\Sigma + N\hat\Sigma_B\right]\beta = N(\hat\mu_2-\hat\mu_1)\nonumber \end{equation}\]

(after simplification), where \(\hat\Sigma_B=\frac{N_1N_2}{N}(\hat\mu_2-\hat\mu_1)(\hat\mu_2-\hat\mu_1)^T\).

(c)

Hence show that \(\hat\Sigma_B\beta\) is in the direction \((\hat\mu_2-\hat\mu_1)\) and thus

\[\begin{equation} \hat\beta \propto \hat\Sigma^{-1}(\hat\mu_2 - \hat\mu_1).\nonumber \end{equation}\]

Therefore the least-squares regression coefficient is identical to the LDA coefficient, up to a scalar mupliple.

(d)

Show that this result holds for any (distinct) coding of the two classes.

(e)

Find the solution \(\hat\beta_0\) (up to the same scalar multiple as in (c)), and hence the predicted value \(\hat f(x) = \hat\beta_0 + x^T\hat\beta\). Consider the following rule: classify to class 2 if \(\hat f(x) > 0\) and class 1 otherwise. Show this is not the same as the LDA rule unless the classes have equal numbers of observations.

(The use of multiple measurements in taxonomic problems, Pattern Recognition and Neural Networks)

Soln. 4.2
(a)

We have \(\pi_1=N_1/N\) and \(\pi_2 = N_2/N\). The conclusion follows directly by (4.9) in the textbook.

(b)

We start by introducing notations used in Chapter 3.

Let \(x_i^T = (x_{i1}, ..., x_{ip})\in \mathbb{R}^{1\times p}\), \(\textbf{1}^T = (1,...,1)\in \mathbb{R}^{1\times p}\), \(Y^T = (y_1, ..., y_N)\in \mathbb{R}^{1\times N}\), \(\beta^T = (\beta_{1}, ..., \beta_{p})\in \mathbb{R}^{1\times p}\).

\[\begin{equation} \textbf{X} = \begin{pmatrix} 1 & x_{11} & \cdots & x_{1p} \\ 1 & x_{21} & \cdots & x_{2p} \\ \vdots & \vdots & \ddots & \vdots \\ 1 & x_{N1} & \cdots & x_{Np} \end{pmatrix} = \begin{pmatrix} 1 & x_1^T\\ 1 & x_2^T\\ \vdots & \vdots\\ 1 & x_N^T \end{pmatrix}\in \mathbb{R}^{N\times (p+1)}\nonumber \end{equation}\]

and

\[\begin{equation} \textbf{X}^T = \begin{pmatrix} 1 & 1 & \cdots & 1 \\ x_{11} & x_{21} & \cdots & x_{N1} \\ \vdots & \vdots & \ddots & \vdots \\ x_{1p} & x_{2p} & \cdots & x_{Np} \end{pmatrix} = \begin{pmatrix} 1 & 1 & \cdots & 1\\ x_1 & x_2 & \cdots & x_N \end{pmatrix}\in \mathbb{R}^{(p+1)\times N}.\nonumber \end{equation}\]

So that we have

\[\begin{equation} \textbf{X}^T\textbf{X} = \begin{pmatrix} N & \sum_{i=1}^Nx_i^T\\ \sum_{i=1}^Nx_i & \sum_{i=1}^Nx_ix_i^T \end{pmatrix} \in \mathbb{R}^{(p+1)\times (p+1)}.\nonumber \end{equation}\]

From knowledge in linear regression, e.g., (3.6) in the textbook, we have

\[\begin{equation} \label{eq:ex42normal} \textbf{X}^T\textbf{X}\begin{pmatrix} \beta_0\\ \beta \end{pmatrix} = \begin{pmatrix} N & \sum_{i=1}^Nx_i^T\\ \sum_{i=1}^Nx_i & \sum_{i=1}^Nx_ix_i^T \end{pmatrix} \begin{pmatrix} \beta_0\\ \beta \end{pmatrix}= \textbf{X}^TY= \begin{pmatrix} \sum_{i=1}^Ny_i\\ \sum_{i=1}^Ny_ix_i \end{pmatrix}\in \mathbb{R}^{(p+1)\times 1}.\nonumber \end{equation}\]

Therefore we have

\[\begin{eqnarray} N\beta_0 + \left(\sum_{i=1}^Nx_i^T\right)\beta &=& \sum_{i=1}^Ny_i\nonumber\\ \beta_0\sum_{i=1}^Nx_i + \left(\sum_{i=1}^Nx_ix_i^T\right)\beta&=&\sum_{i=1}^Ny_ix_i.\label{eq:ex43beta0} \end{eqnarray}\]

From the first equation above, we get

\[\begin{equation} \label{eq:ex43beta0solution} \beta_0 = \frac{1}{N}\left(\sum_{i=1}^Ny_i-\left(\sum_{i=1}^Nx_i^T\right)\beta\right), \end{equation}\]

plug above into \(\eqref{eq:ex43beta0}\) and solve for \(\beta\) we obtain

\[\begin{equation} \left[\sum_{i=1}^Nx_ix_i^T - \frac{1}{N}\left(\sum_{i=1}^Nx_i\right)\left(\sum_{i=1}^Nx_i^T\right)\right]\beta = \sum_{i=1}^Ny_ix_i-\frac{1}{N}\left(\sum_{i=1}^Ny_i\right)\left(\sum_{i=1}^Nx_i\right).\label{eq:ex43beta} \end{equation}\]

We pause here and turn to \(\hat\mu_1, \hat\mu_2\) and \(\Sigma\). Let's we encode \(y_i\) for class 1 as \(t_1\) and \(y_i\) for class 2 as \(t_2\). Note that for this particular problem (b), \(t_1=-N/N_1\) and \(t_2=N/N_2\). By definition, we have

\[\begin{eqnarray} \hat\mu_1 = \frac{1}{N_1}\sum_{i:g_i=1}x_i, && \hat\mu_2 = \frac{1}{N_2}\sum_{i:g_i=2}x_i\nonumber\\ \sum_{i=1}^Nx_i = N_1\hat\mu_1 + N_2\hat\mu_2, && \sum_{i=1}^Nx_i^T = N_1\hat\mu_1^T + N_2\hat\mu_2^T\nonumber\\ \sum_{i=1}^Ny_i = N_1t_1 + N_2t_2, && \sum_{i=1}^Ny_ix_i = t_1N_1\hat\mu_1 + t_2N_2\hat\mu_2 \nonumber \end{eqnarray}\]

and

\[\begin{equation} \hat\Sigma = \frac{1}{N-2}\left(\sum_{i:g_i=1}(x_i-\hat\mu_1)(x_i-\hat\mu_1)^T + \sum_{i:g_i=2}(x_i-\hat\mu_2)(x_i-\hat\mu_2)^T\right).\nonumber \end{equation}\]

From equations above we can rewrite

\[\begin{equation} \sum_{i=1}^Nx_ix_i^T = (N-2)\hat\Sigma + N_1\hat\mu_1\hat\mu_1^T + N_2\hat\mu_2\hat\mu_2^T.\nonumber \end{equation}\]

Now we turn back to \(\eqref{eq:ex43beta}\). For the LHS,

\[\begin{eqnarray} &&\sum_{i=1}^Nx_ix_i^T - \frac{1}{N}\left(\sum_{i=1}^Nx_i\right)\left(\sum_{i=1}^Nx_i^T\right)\nonumber\\ &=&\sum_{i=1}^Nx_ix_i^T - \frac{1}{N}\Big(N_1\hat\mu_1 + N_2\hat\mu_2\Big)\Big(N_1\hat\mu_1^T + N_2\hat\mu_2^T\Big)\nonumber\\ &=&(N-2)\hat\Sigma + N_1\hat\mu_1\hat\mu_1^T + N_2\hat\mu_2\hat\mu_2^T - \frac{1}{N}\Big(N_1\hat\mu_1 + N_2\hat\mu_2\Big)\Big(N_1\hat\mu_1^T + N_2\hat\mu_2^T\Big)\nonumber\\ &=&(N-2)\hat\Sigma + \frac{N_1N_2}{N}(\hat\mu_2-\hat\mu_1)(\hat\mu_2-\hat\mu_1)^T\nonumber\\ &=&(N-2)\hat\Sigma + N\hat\Sigma_B, \label{eq:ex43LHS} \end{eqnarray}\]

where \(\hat\Sigma_B = \frac{N_1N_2}{N^2}(\hat\mu_2-\hat\mu_1)(\hat\mu_2-\hat\mu_1)^T\).

For the RHS of \(\eqref{eq:ex43beta}\),

\[\begin{eqnarray} &&\sum_{i=1}^Ny_ix_i-\frac{1}{N}\left(\sum_{i=1}^Ny_i\right)\left(\sum_{i=1}^Nx_i\right)\nonumber\\ &=&t_1N_1\hat\mu_1 + t_2N_2\hat\mu_2-\frac{1}{N}\Big(N_1t_1 + N_2t_2\Big)\Big(N_1\hat\mu_1 + N_2\hat\mu_2\Big)\nonumber\\ &=&\frac{N_1N_2}{N}(t_2-t_1)(\hat\mu_2-\hat\mu_1).\label{eq:ex43rhs} \end{eqnarray}\]

Combining \(\eqref{eq:ex43beta}\), \(\eqref{eq:ex43LHS}\) and \(\eqref{eq:ex43rhs}\) we get

\[\begin{equation} \label{eq:ex43b} \left[(N-2)\hat\Sigma + N\hat\Sigma_B\right]\beta = \frac{N_1N_2}{N}(t_2-t_1)(\hat\mu_2-\hat\mu_1). \end{equation}\]

Note that \(t_1=-N/N_1\), \(t_2 = N/N_2\) and \(N=N_1+N_2\), so that \(t_2-t_1=\frac{N^2}{N_1N_2}\).

Thus, \(\eqref{eq:ex43rhs}\) reduces to \(N(\hat\mu_2-\hat\mu_1)\) and we finish the proof.

(c)

We have

\[\begin{equation} \hat\Sigma_B\beta = \frac{N_1N_2}{N^2}(\hat\mu_2-\hat\mu_1)(\hat\mu_2-\hat\mu_1)^T\beta\non \end{equation}\]

where \((\hat\mu_2-\hat\mu_1)^T\beta\in \mathbb{R}\) is a scalar, thus \(\hat\Sigma_B\beta\) is in the direction \((\hat\mu_2-\hat\mu_1)\). By part (b) we have

\[\begin{equation} \hat\beta\propto \hat\Sigma^{-1}(\hat\mu_2 - \hat\mu_1).\non \end{equation}\]
(d)

This follows directly from \(\eqref{eq:ex43b}\) for \(t_1\neq t_2\).

(e)

Assuming the encoding of \(-N/N_1\) and \(N/N_2\), by \(\eqref{eq:ex43beta0solution}\) we have

\[\begin{eqnarray} \hat\beta_0 &=& \frac{1}{N}\left(\sum_{i=1}^Ny_i-\left(\sum_{i=1}^Nx_i^T\right)\beta\right)\non\\ &=&-\frac{1}{N}\left(\sum_{i=1}^Nx_i^T\right)\hat\beta\non\\ &=&-\frac{1}{N}(N_1\hat\mu_1^T + N_2\hat\mu_2^T)\hat\beta \end{eqnarray}\]

so that

\[\begin{equation} \hat f(x) = \hat\beta_0 + x^T\hat\beta = \left[x^T-\frac{1}{N}(N_1\hat\mu_1^T + N_2\hat\mu_2^T)\right]\hat\beta.\non \end{equation}\]

Since \(\hat\beta\propto \hat\Sigma^{-1}(\hat\mu_2-\hat\mu_1)\), there exists \(\lambda > 0\) (up to a scalar constant, i.e., we can flip the classification sign if \(\lambda <0\)) such that \(\hat\beta = \lambda \hat\Sigma^{-1}(\hat\mu_2-\hat\mu_1)\). Therefore, \(\hat f(x) > 0\) is equivalent to

\[\begin{equation} \left[x^T-\frac{1}{N}(N_1\hat\mu_1^T + N_2\hat\mu_2^T)\right]\hat\Sigma^{-1}(\hat\mu_2-\hat\mu_1) > 0,\non \end{equation}\]

which is equivalent to LDA rule \(\eqref{eq:ex43LDA}\) when \(N_1 = N_2\). When \(N_1\neq N_2\), \(\log (N_2/N_1) \neq 0\) in \(\eqref{eq:ex43LDA}\) so they are not equivalent.