Ex. 4.6

Suppose we have \(N\) points \(x_i\) in \(\mathbb{R}^p\) in general position, with class labels \(g_i\in\{-1,1\}\). Prove that the perceptron learning algorithm converges to a separating hyperplane in a finite number of steps:

(a) Denote a hyperplane by \(f(x) = \beta_1^Tx + \beta_0 = 0\) or in a more compact notation \(\beta^Tx^\ast=0\), where \(x^\ast = (x,1)\) and \(\beta=(\beta_1, \beta_0)\). Let \(z_i = x_i^\ast/\|x_i^\ast\|\). Show that separability implies the existence of a \(\beta_{\text{sep}}\) such that \(y_i\beta^T_{\text{sep}}z_i\ge 1\ \ \forall i\).
(b) Given a current \(\beta_{\text{old}}\), the perceptron algorithm identifies a point \(z_i\) that is misclassified, and produces the update \(\beta_{\text{new}}\leftarrow \beta_{\text{old}} + y_iz_i\). Show that \(\|\beta_{\text{new}}-\beta_{\text{sep}}\|^2\le \|\beta_{\text{old}}-\beta_{\text{sep}}\|^2 - 1\), and hence that the algorithm converges to a separating hyperplane in no more than \(\|\beta_{\text{start}}-\beta_{\text{sep}}\|^2\) steps （Pattern Recognition and Neural Networks）.

Soln. 4.6

(a) By definition of separability, there exists \(\beta\) such that

\[\begin{eqnarray} \beta^Tx_i > &0& \text{for}\ \ y_i = 1\non\\ \beta^Tx_i < &0& \text{for}\ \ y_i = -1.\non \end{eqnarray}\]

Thus we have \(y_i\beta^Tx_i > 0\) for all \(x_i\), thus for \(y_i\beta^Tz_i > 0\) for all \(z_i\).

Define

\[\begin{equation} m := \min_{i} \|y_i\beta^Tz_i\|.\non \end{equation}\]

Thus, \(y_i(\frac{1}{m}\beta^T)z_i\ge 1\). So there exists a \(\beta_{\text{sep}} := \frac{1}{m}\beta\) such that \(y_i\beta^T_{\text{sep}}z_i\ge 1 \ \forall i\).

(b) We have

\[\begin{eqnarray} \|\beta_{\text{new}} - \beta_{\text{sep}}\|^2 &=& \|\beta_{\text{old}} - \beta_{\text{sep}} + y_iz_i\|^2\non\\ &=&\|\beta_{\text{old}} - \beta_{\text{sep}}\|^2 + \|y_iz_i\|^2 + 2y_i(\beta_{\text{old}} - \beta_{\text{sep}})^Tz_i\non\\ &=&\|\beta_{\text{old}} - \beta_{\text{sep}}\|^2 + 1 + 2y_i\beta_{\text{old}}^Tz_i - 2y_i\beta_{\text{sep}}^Tz_i\non\\ &\le&\|\beta_{\text{old}} - \beta_{\text{sep}}\|^2 + 1 + 2\cdot 0 - 2 \cdot 1 \non\\ &=&\|\beta_{\text{old}} - \beta_{\text{sep}}\|^2-1.\non \end{eqnarray}\]