Degrees of freedom of a tree. Given data with mean and variance , and a fitting operation , let's define the degrees of freedom of a fit by .
Consider a fit estimated by a regression tree, fit to a set of predictors .
(a) In terms of the number of terminal nodes , give a rough formula for the degrees of freedom of the fit.
(b) Generate 100 observations with predictors as independent standard Gaussian variates and fix these values.
(c) Generate response values also as standard Gaussian (), independent of predictors. Fit regression trees to the data of fixed size 1, 5, and 10 terminal nodes and hence estimate the degrees of the freedom of each fit. [Do ten simulations of the response and average the results, to get a good estimate of degrees of freedom.]
(d) Compare your estimates of degrees of freedom in (a) and (c) and discuss.
(e) If the regression tree fit were a linear operation, we could write for some matrix . Then the degrees of freedom would be . Suggest a way to compute an approximate matrix for a regression tree, compute it and compare the resulting degrees of freedom to those in (a) and (c).
Soln. 9.5
(a) Suppose we have a single terminal node, i.e., . Then , so that
Consider the case when with two regions and . For , . Similar for those in .
We have
(c) We estimated degrees of freedom to be 1, 5.73 and 10.66 respectively. See Code below.
(d) We see that our rough guess of degrees of freedom is close to but underestimate the ones obtained from simulations.
(e) In fact, the idea we used in (a) is assuming we are doing a linear operation. When , is a matrix having equal elements . For general , we have if th and th observations fall into the same region, and otherwise. Therefore, . There should be other better perspectives on this problem.
importnumpyasnpfromsklearn.treeimportDecisionTreeRegressornp.random.seed(42)p=10mean=np.zeros(p)cov=np.identity(p)n=100X=np.random.multivariate_normal(mean,cov,n)N=20y=np.random.normal(0,1,size=(n,N))y_pred=np.copy(y)numLeaves=10foriinrange(N):y_cur=y[:,i]reg=DecisionTreeRegressor(max_leaf_nodes=numLeaves)reg.fit(X,y_cur)print('Depth: {}, Number of leaves:{}'.format(reg.get_depth(),reg.get_n_leaves()))y_pred_cur=reg.predict(X)y_pred[:,i]=y_pred_curdf=0foriinrange(y.shape[0]):df+=np.cov(y[i,:],y_pred[i,:],ddof=0)[0][1]print('Estimated degrees of freedom with {} terminal nodes is: {}'.format(numLeaves,df))