Ex. 9.5

Degrees of freedom of a tree. Given data \(y_i\) with mean \(f(x_i)\) and variance \(\sigma^2\), and a fitting operation \(\by \rightarrow \hat{\by}\), let's define the degrees of freedom of a fit by \(\sum_{i}\text{cov}(y_i, \hat y_i)/\sigma^2\).

Consider a fit \(\hat{\by}\) estimated by a regression tree, fit to a set of predictors \(X_1, X_2,...,X_p\).

(a) In terms of the number of terminal nodes \(m\), give a rough formula for the degrees of freedom of the fit.

(b) Generate 100 observations with predictors \(X_1, X_2, ..., X_{10}\) as independent standard Gaussian variates and fix these values.

(c) Generate response values also as standard Gaussian (\(\sigma^2=1\)), independent of predictors. Fit regression trees to the data of fixed size 1, 5, and 10 terminal nodes and hence estimate the degrees of the freedom of each fit. [Do ten simulations of the response and average the results, to get a good estimate of degrees of freedom.]

(d) Compare your estimates of degrees of freedom in (a) and (c) and discuss.

(e) If the regression tree fit were a linear operation, we could write \(\hat{\by} = \bb{S}\by\) for some matrix \(\bb{S}\). Then the degrees of freedom would be \(\text{tr}(\bb{S})\). Suggest a way to compute an approximate \(\bb{S}\) matrix for a regression tree, compute it and compare the resulting degrees of freedom to those in (a) and (c).

Soln. 9.5

(a) Suppose we have a single terminal node, i.e., \(m=1\). Then \(\hat y_i = \bar y\), so that

\[\begin{equation} \sum_{i}\text{cov}(y_i, \hat y_i)/\sigma^2 = n \frac{\sigma^2/n}{\sigma^2} =1.\non \end{equation}\]

Consider the case when \(m=2\) with two regions \(R_1\) and \(R_2\). For \(i\in R_1\), \(\hat y_i = \frac{1}{|R_1|}\sum_{i\in R_1}y_i\). Similar for those in \(R_2\). We have

\[\begin{eqnarray} &&\sum_{i}\text{cov}(y_i, \hat y_i)/\sigma^2\non\\ &=& \sum_{i\in R_1} \frac{\sigma^2/|R_1|}{\sigma^2} + \sum_{i\in R_2} \frac{\sigma^2/|R_2|}{\sigma^2}\non\\ &=&2.\non \end{eqnarray}\]

Similar arguments apply to the general \(m\), thus we guess the degrees of freedom is \(m\). See On measuring and correcting the effects of data mining and model selection for more details on generalized degrees of freedom (GDF).

(b) See Code below.

The code and numbers below need revisit.

(c) We estimated degrees of freedom to be 1, 5.73 and 10.66 respectively. See Code below.

(d) We see that our rough guess of degrees of freedom is close to but underestimate the ones obtained from simulations.

(e) In fact, the idea we used in (a) is assuming we are doing a linear operation. When \(m=1\), \(\bb{S}\) is a \(n\times n\) matrix having equal elements \(1/n\). For general \(m\), we have \(s_{ij} = m/n\) if \(i\)th and \(j\)th observations fall into the same region, and \(s_{ij}=0\) otherwise. Therefore, \(\text{tr}(\bb{S})=m\). There should be other better perspectives on this problem.

Code

import numpy as np
from sklearn.tree import DecisionTreeRegressor

np.random.seed(42)

p = 10
mean = np.zeros(p)
cov = np.identity(p)

n = 100
X = np.random.multivariate_normal(mean, cov, n)

N = 20
y = np.random.normal(0, 1, size=(n, N))
y_pred = np.copy(y)

numLeaves = 10
for i in range(N):
    y_cur = y[:, i]
    reg = DecisionTreeRegressor(max_leaf_nodes=numLeaves)
    reg.fit(X, y_cur)
    print('Depth: {}, Number of leaves:{}'.format(reg.get_depth(), reg.get_n_leaves()))
    y_pred_cur = reg.predict(X)
    y_pred[:, i] = y_pred_cur

df = 0
for i in range(y.shape[0]):
    df += np.cov(y[i, :], y_pred[i, :], ddof=0)[0][1]

print('Estimated degrees of freedom with {} terminal nodes is: {}'.format(numLeaves, df))