Ex. 4.2
Ex. 4.2
Suppose we have features \(x\in \mathbb{R}^p\), a two-class response, with class sizes \(N_1\), \(N_2\), and the target coded as \(-N/N_1\), \(N/N_2\).
(a)
Show that the LDA rule classifies to class 2 if
and class 1 otherwise.
(b)
Consider minimization of the least squares criterion
Show that the solution \(\hat\beta\) satisfies
(after simplification), where \(\hat\Sigma_B=\frac{N_1N_2}{N}(\hat\mu_2-\hat\mu_1)(\hat\mu_2-\hat\mu_1)^T\).
(c)
Hence show that \(\hat\Sigma_B\beta\) is in the direction \((\hat\mu_2-\hat\mu_1)\) and thus
Therefore the least-squares regression coefficient is identical to the LDA coefficient, up to a scalar mupliple.
(d)
Show that this result holds for any (distinct) coding of the two classes.
(e)
Find the solution \(\hat\beta_0\) (up to the same scalar multiple as in (c)), and hence the predicted value \(\hat f(x) = \hat\beta_0 + x^T\hat\beta\). Consider the following rule: classify to class 2 if \(\hat f(x) > 0\) and class 1 otherwise. Show this is not the same as the LDA rule unless the classes have equal numbers of observations.
(The use of multiple measurements in taxonomic problems, Pattern Recognition and Neural Networks)
Soln. 4.2
(a)
We have \(\pi_1=N_1/N\) and \(\pi_2 = N_2/N\). The conclusion follows directly by (4.9) in the textbook.
(b)
We start by introducing notations used in Chapter 3.
Let \(x_i^T = (x_{i1}, ..., x_{ip})\in \mathbb{R}^{1\times p}\), \(\textbf{1}^T = (1,...,1)\in \mathbb{R}^{1\times p}\), \(Y^T = (y_1, ..., y_N)\in \mathbb{R}^{1\times N}\), \(\beta^T = (\beta_{1}, ..., \beta_{p})\in \mathbb{R}^{1\times p}\).
and
So that we have
From knowledge in linear regression, e.g., (3.6) in the textbook, we have
Therefore we have
From the first equation above, we get
plug above into \(\eqref{eq:ex43beta0}\) and solve for \(\beta\) we obtain
We pause here and turn to \(\hat\mu_1, \hat\mu_2\) and \(\Sigma\). Let's we encode \(y_i\) for class 1 as \(t_1\) and \(y_i\) for class 2 as \(t_2\). Note that for this particular problem (b), \(t_1=-N/N_1\) and \(t_2=N/N_2\). By definition, we have
and
From equations above we can rewrite
Now we turn back to \(\eqref{eq:ex43beta}\). For the LHS,
where \(\hat\Sigma_B = \frac{N_1N_2}{N^2}(\hat\mu_2-\hat\mu_1)(\hat\mu_2-\hat\mu_1)^T\).
For the RHS of \(\eqref{eq:ex43beta}\),
Combining \(\eqref{eq:ex43beta}\), \(\eqref{eq:ex43LHS}\) and \(\eqref{eq:ex43rhs}\) we get
Note that \(t_1=-N/N_1\), \(t_2 = N/N_2\) and \(N=N_1+N_2\), so that \(t_2-t_1=\frac{N^2}{N_1N_2}\).
Thus, \(\eqref{eq:ex43rhs}\) reduces to \(N(\hat\mu_2-\hat\mu_1)\) and we finish the proof.
(c)
We have
where \((\hat\mu_2-\hat\mu_1)^T\beta\in \mathbb{R}\) is a scalar, thus \(\hat\Sigma_B\beta\) is in the direction \((\hat\mu_2-\hat\mu_1)\). By part (b) we have
(d)
This follows directly from \(\eqref{eq:ex43b}\) for \(t_1\neq t_2\).
(e)
Assuming the encoding of \(-N/N_1\) and \(N/N_2\), by \(\eqref{eq:ex43beta0solution}\) we have
so that
Since \(\hat\beta\propto \hat\Sigma^{-1}(\hat\mu_2-\hat\mu_1)\), there exists \(\lambda > 0\) (up to a scalar constant, i.e., we can flip the classification sign if \(\lambda <0\)) such that \(\hat\beta = \lambda \hat\Sigma^{-1}(\hat\mu_2-\hat\mu_1)\). Therefore, \(\hat f(x) > 0\) is equivalent to
which is equivalent to LDA rule \(\eqref{eq:ex43LDA}\) when \(N_1 = N_2\). When \(N_1\neq N_2\), \(\log (N_2/N_1) \neq 0\) in \(\eqref{eq:ex43LDA}\) so they are not equivalent.