Ex. 9.5

Ex. 9.5

Degrees of freedom of a tree. Given data yi with mean f(xi) and variance σ2, and a fitting operation yy^, let's define the degrees of freedom of a fit by icov(yi,y^i)/σ2.

Consider a fit y^ estimated by a regression tree, fit to a set of predictors X1,X2,...,Xp.

(a) In terms of the number of terminal nodes m, give a rough formula for the degrees of freedom of the fit.

(b) Generate 100 observations with predictors X1,X2,...,X10 as independent standard Gaussian variates and fix these values.

(c) Generate response values also as standard Gaussian (σ2=1), independent of predictors. Fit regression trees to the data of fixed size 1, 5, and 10 terminal nodes and hence estimate the degrees of the freedom of each fit. [Do ten simulations of the response and average the results, to get a good estimate of degrees of freedom.]

(d) Compare your estimates of degrees of freedom in (a) and (c) and discuss.

(e) If the regression tree fit were a linear operation, we could write y^=Sy for some matrix S. Then the degrees of freedom would be tr(S). Suggest a way to compute an approximate S matrix for a regression tree, compute it and compare the resulting degrees of freedom to those in (a) and (c).

Soln. 9.5

(a) Suppose we have a single terminal node, i.e., m=1. Then y^i=y¯, so that

icov(yi,y^i)/σ2=nσ2/nσ2=1.

Consider the case when m=2 with two regions R1 and R2. For iR1, y^i=1|R1|iR1yi. Similar for those in R2. We have

icov(yi,y^i)/σ2=iR1σ2/|R1|σ2+iR2σ2/|R2|σ2=2.

Similar arguments apply to the general m, thus we guess the degrees of freedom is m. See On measuring and correcting the effects of data mining and model selection for more details on generalized degrees of freedom (GDF).

(b) See Code below.

The code and numbers below need revisit.

(c) We estimated degrees of freedom to be 1, 5.73 and 10.66 respectively. See Code below.

(d) We see that our rough guess of degrees of freedom is close to but underestimate the ones obtained from simulations.

(e) In fact, the idea we used in (a) is assuming we are doing a linear operation. When m=1, S is a n×n matrix having equal elements 1/n. For general m, we have sij=m/n if ith and jth observations fall into the same region, and sij=0 otherwise. Therefore, tr(S)=m. There should be other better perspectives on this problem.

Code
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
import numpy as np
from sklearn.tree import DecisionTreeRegressor

np.random.seed(42)

p = 10
mean = np.zeros(p)
cov = np.identity(p)

n = 100
X = np.random.multivariate_normal(mean, cov, n)

N = 20
y = np.random.normal(0, 1, size=(n, N))
y_pred = np.copy(y)

numLeaves = 10
for i in range(N):
    y_cur = y[:, i]
    reg = DecisionTreeRegressor(max_leaf_nodes=numLeaves)
    reg.fit(X, y_cur)
    print('Depth: {}, Number of leaves:{}'.format(reg.get_depth(), reg.get_n_leaves()))
    y_pred_cur = reg.predict(X)
    y_pred[:, i] = y_pred_cur

df = 0
for i in range(y.shape[0]):
    df += np.cov(y[i, :], y_pred[i, :], ddof=0)[0][1]

print('Estimated degrees of freedom with {} terminal nodes is: {}'.format(numLeaves, df))