Ex. 4.5

Consider a two-class logistic regression problem with \(x\in\mathbb{R}\). Characterize the maximum-likelihood estimates of the slope and intercept parameter if the sample \(x_i\) for two classes are separated by a point \(x_0\in\mathbb{R}\). Generalize this result to (a) \(x\in \mathbb{R}^p\) (see Figure 4.16 in the textbook), and (b) more than two classes.

Soln. 4.5

Naturally we label \(y_i=1\) for those \(x_i > x_0\) and \(y_i=0\) for those \(x_i < x_0\). The log-likelihood for \(N\) observations is

\[\begin{eqnarray} l(\beta) &=& \sum_{i=1}^N\big[y_i\beta^Tx_i -\log(1+e^{\beta^Tx_i})\big]\non\\ &=&\sum_{i=1}^N\big[y_i(\beta_0 + \beta_1x_i) - \log(1+e^{\beta_0 + \beta_1x_i})\big]\non\\ &=&\sum_{i=1}^N\big[y_i(\beta_0 + \beta_1x_0 + \beta_1(x_i-x_0) ) - \log(1+e^{\beta_0 + \beta_1x_0 + \beta_1(x_i-x_0)})\big].\non \end{eqnarray}\]

By choosing \(\beta_0 = -\beta_1 x_0\), the equation above is simplified as

\[\begin{equation} l(\beta) = \sum_{i:x_i < x_0}[- \log(1+e^{\beta_1(x_i-x_0)})] + \sum_{i:x_i > x_0}[\beta_1(x_i-x_0)- \log(1+e^{\beta_1(x_i-x_0)})].\non \end{equation}\]

It's easy to verify that the first term above vanishes while the second term above goes to infinity when \(\beta_1\ra\infty\), thus the maximum of the log-likelihood will never be achieved.

(a) When \(x\in\mathbb{R}^p\) with \(p > 1\), there exists two subsets of \(\{x_1,...,x_N\}\), \(S_1\) and \(S_2\), such that \(S_1\cup S_2=\{x_1,...,x_N\}\), \(S_1\cap S_2 = \emptyset\). Further more, there exists a hyperplane \(\hat\beta^T x = 0\) in \(\mathbb{R}^{p+1}\) and

\[\begin{equation} \beta^Tx > 0 \text{ for } x\in S_1 \text{ and } \beta^Tx < 0 \text{ for } x\in S_2.\non \end{equation}\]

We label \(y_i=1\) for \(x_i\in S_1\) and \(y_i=0\) for \(x_i\in S_2\). The log-likelihood becomes

\[\begin{eqnarray} l(\beta) &=& \sum_{i=1}^N[y_i\beta^Tx_i - \log(1+e^{\beta^Tx_i})]\non\\ &=&\sum_{i\in S_1}\beta^Tx_i-\log(1+e^{\beta^Tx_i}) + \sum_{i\in S_2}\left[-\log(1+e^{\beta^Tx_i})\right].\non \end{eqnarray}\]

Like the case when \(p=1\), if we keep updating \(\beta\) by \(\beta^{\text{new}}\leftarrow \alpha\beta^{\text{old}}\) with \(\alpha\ra+\infty\), the log-likelihood \(l(\beta)\ra+\infty\).

(b) For the case where there are \(K>2\) classes in \(\mathbb{R}^p\) with \(p>1\), the same arguments follow. In particular, the log-likelihood becomes

\[\begin{eqnarray} l(\beta) &=& \sum_{i=1}^N \left[\sum_{k=1}^{K-1}\bb{1}(y_i=k)\beta_k^Tx_i-\log(1+\sum_{l=1}^{K-1}e^{\beta_l^Tx_i})\right]\non\\ &=&\sum_{k=1}^{K-1}\sum_{i\in S_k} \left[\beta_k^Tx_i - \log(1 + \sum_{l=1}^{K-1}e^{\beta_l^Tx_i})\right]\non\\ && + \sum_{i\in S_K}\left[-\log(1+\sum_{l=1}^{K-1}e^{\beta_l^Tx_i})\right]\non \end{eqnarray}\]

There exists \(\beta_k\) for \(k=1,...,K-1\) such that \(\beta^T_kx > 0\) for \(x\in S_k\), therefore, by similar arguments of increasing \(\beta_k\) in (a), the log-likelihood \(l(\beta)\ra+\infty\).