Ex. 3.27
Ex. 3.27
Lasso and LAR: Consider the lasso problem in Lagrange multiplier form: with \(L(\beta)=\frac{1}{2}\sum_i(y_i-\sum_j x_{ij}\beta_j)^2\), we minimize
for fixed \(\lambda>0\).
(a)
Setting \(\beta_j=\beta_j^+ - \beta_j^-\) with \(\beta^+_j, \beta^-_j\ge 0\), expression \(\eqref{eq:3-27a}\) becomes \(L(\beta) + \lambda\sum_j(\beta_j^++\beta_j^-)\). Show that the Lagrange dual function is
and the Karush-Kuhn-Tucker optimality conditions are
along with the non-negativity constraints on the parameters and all the Lagrange multipliers.
(b)
Show that \(|\nabla L(\beta)_j|\le \lambda \ \forall j\), and that the KKT conditions imply one of the following three scenarios:
Hence show that for any ``active'' predictor having \(\beta_j\neq 0\), we must have \(\nabla L(\beta)_j = -\lambda\) if \(\beta_j > 0\), and \(\nabla L(\beta)_j = \lambda\) if \(\beta_j < 0\). Assuming the predictors are standardized, relate \(\lambda\) to the correlation between the \(j\)th predictor and the current residuals.
(c)
Suppose that the set of active predictors is unchanged for \(\lambda_0\ge\lambda\ge\lambda_1\). Show that there is a vector \(\gamma_0\) such that
Thus the lasso solution path is linear as \(\lambda\) ranges from \(\lambda_0\) to \(\lambda_1\) (Efron et al., 2004 Least Angle Regression; Rosset and Zhu, 2007 Piecewise linear regularized solution paths}).
Soln. 3.27
(a)
Now we have new variables \(\beta_j^+\ge 0\) and \(\beta_j^-\ge 0\), so we need to introduce new variables \(\lambda_j^+\) and \(\lambda_j^-\) in the Lagrange function:
Taking derivatives of above w.r.t \(\beta_j^+\) and \(\beta_j^-\) separately we obtain
The other two conditions are the complementary slackness conditions:
(b)
From the first two equations in (a) we obtain
and
Therefore
So we have \(\lambda = 0 \Rightarrow \nabla L(\beta)_j = 0\).
If \(\beta_j^+ > 0\) and \(\lambda > 0\), from complementary slackness conditions we know \(\lambda_j^+ = 0\) and thus \(\nabla L(\beta)_j = -\lambda < 0\). Then we have \(\lambda_j^- = 2\lambda > 0\), by complementary slackness conditions again we have \(\beta_j^-=0\). Therefore we have
Similar arguments yields
It's then easy to see \(\nabla L(\beta)_j = -\lambda\text{sign}(\beta_j)\).
By definition of \(L(\beta)\), we have
Under the assumption that the predictors are standardized, we see \(|\lambda|\) equals the absolute value of the correlation between the \(j\)th predictor and the current residuals.
(c)
From (b) we have
for all \(j\in S\) where \(S\) is the set of active variables. Note that if \(j\notin S\), we have \(\hat\beta_j(\lambda)=0\). Therefore we can write
where \(\theta\) is a vector function such that
Therefore we solve for \(\hat\beta(\lambda)\) as
Thus we see
where
Thus the lasso solution path is linear as \(\lambda\) ranges from \(\lambda_0\) to \(\lambda_1\)