Ex. 18.2

Nearest shrunken centroids and the lasso. Consider a (naive Bayes) Gaussian model for classification in which the features \(j=1,2,...,p\) are assumed to be independent within each class \(k=1,2,...,K\). With observations \(i=1,2,...,N\) and \(C_k\) equal to the set of indices of the \(N_k\) observations in class \(k\), we observe \(x_{ij}\sim N(\mu_j + \mu_{jk}, \sigma^2_j)\) for \(i\in C_k\) with \(\sum_{k=1}^K\mu_{jk}=0\). Set \(\hat\sigma_j^2=s_j^2\), the pooled within-class variance for feature \(j\), and consider the lasso-style minimization problem

\[\begin{equation} \min_{\{\mu_j, \mu_{jk}\}}\left\{\frac{1}{2}\sum_{j=1}^p\sum_{k=1}^K\sum_{i\in C_k}\frac{(x_{ij}-\mu_j-\mu_{jk})^2}{s_j^2}+\lambda\sqrt{N_k}\sum_{j=1}^p\sum_{k=1}^K|\frac{\mu_{jk}}{s_j}|\right\}.\non \end{equation}\]

Show that the solution is equivalent to the nearest shrunken centroid estimator (18.5), with \(s_0\) set to zero, and \(M_k\) equal to \(1/N_k\) instead of \(1/N_k-1/N\) as before.

Soln. 18.2

Denote the objective function as \(L(\mu_j, \mu_{jk})\), we take first-order derivative w.r.t to \(\mu_j\) and \(\mu_{jk}\) and set them to be zero. For \(\mu_j\), note that \(\sum_{k=1}^K\mu_{jk}=0\), we have

\[\begin{eqnarray} \frac{\partial L(\mu_j, \mu_{jk})}{\partial \mu_j} &=& - \sum_{k=1}^K\sum_{i\in C_k}\frac{x_{ij}-\mu_j-\mu_{jk}}{s_j^2}\non\\ &=&\frac{1}{s_j^2}\sum_{k=1}^K\sum_{i\in C_k}\mu_j -\frac{1}{s_j^2}\sum_{k=1}^K\sum_{i\in C_k}x_{ij}\non\\ &=&\frac{1}{s_j^2}\left(N\mu_j - \sum_{k=1}^K\sum_{i\in C_k}x_{ij}\right)\non\\ &=&0, \non \end{eqnarray}\]

thus we get

\[\begin{equation} \mu_j = \frac{1}{N}\sum_{k=1}^K\sum_{i\in C_k}x_{ij} = \bar x_j.\non \end{equation}\]

For \(\mu_{jk}\), we have

\[\begin{eqnarray} \frac{\partial L(\mu_j, \mu_{jk})}{\partial \mu_{jk}} &=& -\sum_{i\in C_k}\frac{(x_{ij}-\mu_j-\mu_{jk})}{s_j^2} + \frac{\lambda\sqrt{N_k}}{s_j}\text{sign}(\mu_{jk})\non\\ &=&s_j^2\left[N_k\mu_{jk}+N_k\mu_j-\sum_{i\in C_k}x_{ij}+\lambda\sqrt{N_k}s_j\text{sign}(\mu_{jk})\right]\non\\ &=&0.\non \end{eqnarray}\]

Thus we have

\[\begin{eqnarray} \mu_{jk} &=& \frac{s_j}{\sqrt{N_k}}\left[\frac{\frac{1}{N_k}\sum_{i\in C_k}x_{ij}-\mu_j}{s_j/\sqrt{N_k}}-\lambda \text{sign}(\mu_{jk})\right]\non\\ &=&\frac{s_j}{\sqrt{N_k}}\left[d_{jk}-\lambda \text{sign}(\mu_{jk})\right] \label{eq:18-2a} \end{eqnarray}\]

where

\[\begin{equation} d_{jk} = \frac{\frac{1}{N_k}\sum_{i\in C_k}x_{ij}-\mu_j}{s_j/\sqrt{N_k}}.\non \end{equation}\]

We claim that

\[\begin{equation} \label{eq:18-2b} \mu_{jk} = \frac{1}{\sqrt{N_k}}s_j\text{sign}(d_{jk})(|d_{jk}|-\lambda)_+. \end{equation}\]

To verify that, when \(\mu_{jk} > 0\), \(\eqref{eq:18-2a}\) becomes

\[\begin{equation} \mu_{jk} = \frac{s_j}{\sqrt{N_k}}\left[d_{jk}-\lambda\right] > 0,\non \end{equation}\]

so that \(d_{jk} > \lambda > 0\) and \(\eqref{eq:18-2b}\) is the same as the expression above. Similar arguments goes for the case when \(\mu_{jk} \le 0\).

From \(\eqref{eq:18-2b}\) it's easy to see the proof is complete.