Ex. 18.3

Show that the fitted coefficients for the regularized multiclass logistic regression problem (18.10) satisfy \(\sum_{k=1}^K\hat\beta_{kj}=0, j=1,...,p.\) What about the \(\hat\beta_{k0}\)? Discuss issues with these constant parameters, and how they can be resolved.

Soln. 18.3

The objective function can be written as

\[\begin{eqnarray} L(\bm{\beta}) &=& \sum_{i=1}^N\left[\sum_{k=1}^K\bb{1}(g_i=k)(\beta_{k0}+x_i^T\beta_k) - \log\left(\sum_{l=1}^K\exp(\beta_{l0}+x_i^T\beta_l)\right) - \frac{\lambda}{2}\sum_{k=1}^K\|\beta_k\|_2^2 \right].\non \end{eqnarray}\]

Taking first-order derivative w.r.t \(\beta_k\) and setting it to zero, we get

\[\begin{eqnarray} \frac{\partial L(\bm{\beta})}{\partial \beta_k} &=&\sum_{i=1}^N\left[x_i\bb{1}(g_i=k)-\frac{\exp(\beta_{k0}+x_i^T\beta_{k})}{\sum_{l=1}^K\exp(\beta_{l0}+x_i^T\beta_l)}\cdot x_i - \lambda \beta_k\right]\non\\ &=&0.\label{eq:18-3a} \end{eqnarray}\]

By the fact that \(\sum_{k=1}^K\frac{\partial L(\bm{\beta})}{\partial \beta_k} = 0\), it's easy to see that \(\eqref{eq:18-3a}\) leads to

\[\begin{equation} N\sum_{k=1}^L\hat\beta_k = \bb{0}\non. \end{equation}\]

For constant parameters \(\hat\beta_{k0}\), they are not differentiable, in the sense that if we add a common constant \(\alpha\) to each of \(\hat\beta_{k0}\), then the derived probabilities are not changed. Therefore, we need to impose an additional regularization for \(\hat\beta_{k0}\), e.g,

\[\begin{equation} \sum_{k=1}^K\hat\beta_{k0}=0.\non \end{equation}\]