Ex. 3.27

Lasso and LAR: Consider the lasso problem in Lagrange multiplier form: with \(L(\beta)=\frac{1}{2}\sum_i(y_i-\sum_j x_{ij}\beta_j)^2\), we minimize

\[\begin{equation} \label{eq:3-27a} L(\beta) + \lambda\sum_j|\beta_j| \end{equation}\]

for fixed \(\lambda>0\).

(a)

Setting \(\beta_j=\beta_j^+ - \beta_j^-\) with \(\beta^+_j, \beta^-_j\ge 0\), expression \(\eqref{eq:3-27a}\) becomes \(L(\beta) + \lambda\sum_j(\beta_j^++\beta_j^-)\). Show that the Lagrange dual function is

\[\begin{equation} L(\beta) + \lambda \sum_j(\beta^+_j + \beta^-_j)-\sum_{j}\lambda^+_j\beta^+_j-\sum_j\lambda^-_j\beta^-_j\non \end{equation}\]

and the Karush-Kuhn-Tucker optimality conditions are

\[\begin{eqnarray} \nabla L(\beta)_j + \lambda - \lambda_j^+ &=& 0 \non\\ -\nabla L(\beta)_j + \lambda - \lambda_j^- &=& 0 \non\\ \lambda^+_j\beta^+_j &=& 0\non\\ \lambda^-_j\beta^-_j &=& 0,\non \end{eqnarray}\]

along with the non-negativity constraints on the parameters and all the Lagrange multipliers.

(b)

Show that \(|\nabla L(\beta)_j|\le \lambda \ \forall j\), and that the KKT conditions imply one of the following three scenarios:

\[\begin{eqnarray} \lambda = 0 &\Rightarrow& \nabla L(\beta)_j = 0 \ \forall j\non\\ \beta^+_j > 0, \lambda > 0 &\Rightarrow& \lambda^+_j = 0, \nabla L(\beta)_j = -\lambda < 0, \beta_j^-=0\non\\ \beta^-_j > 0, \lambda > 0 &\Rightarrow& \lambda^-_j = 0, \nabla L(\beta)_j = \lambda < 0, \beta_j^+=0.\non \end{eqnarray}\]

Hence show that for any ``active'' predictor having \(\beta_j\neq 0\), we must have \(\nabla L(\beta)_j = -\lambda\) if \(\beta_j > 0\), and \(\nabla L(\beta)_j = \lambda\) if \(\beta_j < 0\). Assuming the predictors are standardized, relate \(\lambda\) to the correlation between the \(j\)th predictor and the current residuals.

(c)

Suppose that the set of active predictors is unchanged for \(\lambda_0\ge\lambda\ge\lambda_1\). Show that there is a vector \(\gamma_0\) such that

\[\begin{equation} \hat\beta(\lambda) = \hat\beta(\lambda_0) - (\lambda-\lambda_0)\gamma_0.\non \end{equation}\]

Thus the lasso solution path is linear as \(\lambda\) ranges from \(\lambda_0\) to \(\lambda_1\) (Efron et al., 2004 Least Angle Regression; Rosset and Zhu, 2007 Piecewise linear regularized solution paths}).

Soln. 3.27

(a)

Now we have new variables \(\beta_j^+\ge 0\) and \(\beta_j^-\ge 0\), so we need to introduce new variables \(\lambda_j^+\) and \(\lambda_j^-\) in the Lagrange function:

\[\begin{equation} L(\beta) + \lambda\sum_j(\beta_j^+ + \beta_j^-) -\sum_{j}\lambda^+_j\beta^+_j-\sum_j\lambda^-_j\beta^-_j.\non \end{equation}\]

Taking derivatives of above w.r.t \(\beta_j^+\) and \(\beta_j^-\) separately we obtain

\[\begin{eqnarray} \nabla L(\beta)_j + \lambda - \lambda_j^+ &=& 0 \non\\ -\nabla L(\beta)_j + \lambda - \lambda_j^- &=& 0. \non \end{eqnarray}\]

The other two conditions are the complementary slackness conditions:

\[\begin{eqnarray} \lambda^+_j\beta^+_j &=& 0\non\\ \lambda^-_j\beta^-_j &=& 0.\non \end{eqnarray}\]

(b)

From the first two equations in (a) we obtain

\[\begin{equation} \lambda_j^+ + \lambda_j^-=2\lambda\ge0,\non \end{equation}\]

and

\[\begin{equation} \nabla L(\beta)_j = \frac{1}{2}\left(\lambda_j^- - \lambda_j^+\right).\non \end{equation}\]

Therefore

\[\begin{equation} |\nabla L(\beta)_j| = \frac{1}{2}|\lambda_j^- - \lambda_j^+|\le \frac{1}{2}\left(\lambda_j^- + \lambda_j^+\right)=\lambda.\non \end{equation}\]

So we have \(\lambda = 0 \Rightarrow \nabla L(\beta)_j = 0\).

If \(\beta_j^+ > 0\) and \(\lambda > 0\), from complementary slackness conditions we know \(\lambda_j^+ = 0\) and thus \(\nabla L(\beta)_j = -\lambda < 0\). Then we have \(\lambda_j^- = 2\lambda > 0\), by complementary slackness conditions again we have \(\beta_j^-=0\). Therefore we have

\[\begin{equation} \beta^+_j > 0,\ \lambda > 0 \Rightarrow \lambda^+_j = 0,\ \nabla L(\beta)_j = -\lambda < 0,\ \beta_j^-=0.\non \end{equation}\]

Similar arguments yields

\[\begin{equation} \beta^-_j > 0, \ \lambda > 0 \Rightarrow \lambda^-_j = 0,\ \nabla L(\beta)_j = \lambda < 0,\ \beta_j^+=0.\non \end{equation}\]

It's then easy to see \(\nabla L(\beta)_j = -\lambda\text{sign}(\beta_j)\).

By definition of \(L(\beta)\), we have

\[\begin{equation} \nabla L(\beta)_j = x_j^T(y-X\beta).\non \end{equation}\]

Under the assumption that the predictors are standardized, we see \(|\lambda|\) equals the absolute value of the correlation between the \(j\)th predictor and the current residuals.

(c)

From (b) we have

\[\begin{equation} -\lambda\text{sign}(\hat\beta_j(\lambda)) = x_j^T(y-X\hat\beta_j(\lambda)\non \end{equation}\]

for all \(j\in S\) where \(S\) is the set of active variables. Note that if \(j\notin S\), we have \(\hat\beta_j(\lambda)=0\). Therefore we can write

\[\begin{equation} X^T(y-X\hat\beta(\lambda)) = \theta(\lambda),\non \end{equation}\]

where \(\theta\) is a vector function such that

\[\begin{equation} \theta(\lambda)_j = \begin{cases} -\lambda\text{sign}(\hat\beta_j(\lambda)) & \text{if } j \in S\\ x_j^Ty & \text{if } j \notin S.\non \end{cases} \end{equation}\]

Therefore we solve for \(\hat\beta(\lambda)\) as

\[\begin{equation} \hat\beta(\lambda) = (X^TX)^{-1}X^Ty - (X^TX)^{-1}\theta(\lambda).\non \end{equation}\]

Thus we see

\[\begin{eqnarray} \hat\beta(\lambda) - \hat\beta(\lambda_0) &=& (X^TX)^{-1}\left(\theta(\lambda_0)-\theta(\lambda)\right),\non \end{eqnarray}\]

where

\[\begin{equation} \theta(\lambda_0)-\theta(\lambda) = \begin{cases} (\lambda-\lambda_0)\text{sign}(\hat\beta_j(\lambda_0)) & \text{if } j \in S\\ 0 & \text{if } j \notin S.\non \end{cases} \end{equation}\]

Thus the lasso solution path is linear as \(\lambda\) ranges from \(\lambda_0\) to \(\lambda_1\)