Ex. 3.9

Forward stepwise regression. Suppose we have the QR decomposition for the \(N\times q\) matrix \(\textbf{X}_1\) in a multiple regression problem with response \(\textbf{y}\), and we have an additional \(p-q\) predictors in the matrix \(\textbf{X}_2\). Denote the current residual by \(\textbf{r}\). We wish to establish which one of these additional variables will reduce the residual sum-of-squares the most when included with those in \(\textbf{X}_1\). Describe an efficient procedure for doing this.

Soln. 3.9

Without loss of generality we assume \(\textbf{X}_1=(\textbf{x}_1, ...,\textbf{x}_q)\in\mathbb{R}^{N\times q}\). Write the QR decomposition for \(\textbf{X}_1\) as

\[\begin{equation} \textbf{X}_1 = \textbf{Q}_1\textbf{R}_1,\nonumber \end{equation}\]

where \(\textbf{Q}_1=(\textbf{q}_1, ..., \textbf{q}_q)\) is a \(N\times q\) orthogonal matrix and \(\textbf{R}_1\) is a \(q\times a\) upper triangular matrix. We know that \(\textbf{Q}_1\) spans the column space of \(\textbf{X}_1\).

Now we consider choosing an additional predictor \(\textbf{x}_k\), where \(q < k\le p\). The projection of \(\textbf{x}_k\) onto the column space of \(\textbf{X}_1\), denoted by \(\mathcal{P}_{\textbf{X}_1}(\textbf{x}_k)\), is

\[\begin{equation} \mathcal{P}_{\textbf{X}_1}(\textbf{x}_k) = \sum_{j=1}^q(\textbf{x}_k^T\textbf{q}_j) \textbf{q}_j.\nonumber \end{equation}\]

Define

\[\begin{eqnarray} \textbf{r}_k &=& \textbf{x}_k - \mathcal{P}_{\textbf{X}_1}(\textbf{x}_k),\nonumber\\ \textbf{q}_k &=& \textbf{r}_k/\|\textbf{r}_k\|.\nonumber \end{eqnarray}\]

If we add \(\textbf{x}_k\) into \(\textbf{X}_1\), the newly fitted value, which is the projection of \(\textbf{y}\) onto newly spanned space by \(\textbf{X}_1\) and \(\textbf{x}_k\), becomes

\[\begin{equation} \hat{\textbf{y}} + (\textbf{q}_k^T\textbf{y})\textbf{q}_k = \hat{\textbf{y}} + (\textbf{q}_k^T\textbf{r})\textbf{q}_k,\nonumber \end{equation}\]

and thus the residual sum-of-squares is reduced by

\[\begin{equation} (\textbf{q}_k^T\textbf{r})^2.\nonumber \end{equation}\]

Therefore, at each step \(q\) we are going to choose \(k\) such that

\[\begin{equation} k = \underset{q< k\le p}{\operatorname{argmax}} \textbf{q}_k^T\textbf{r}.\nonumber \end{equation}\]

Note the calculations above require the values of \(\textbf{q}_1, ..., \textbf{q}_q\), which are already available from \(\textbf{Q}_1\) in QR decomposition. After we get \(\textbf{q}_k\), we rename it to \(\textbf{q}_{q+1}\), and then we can augment \(\textbf{Q}_1\) to \(\tilde{\textbf{Q}}_1=(\textbf{q}_1, ..., \textbf{q}_q, \textbf{q}_{q+1})\) and continue to next iteration \(q+1\).

Remark

The value of \(\textbf{R}_1\) here is less relevant.