Ex. 2.7

Ex. 2.7

Suppose we have a sample of N pairs xi,yi drawn i.i.d. from the distribution characterized as follows:

xih(x), the design densityyi=f(xi)+ϵi,f is the regression functionϵi(0,σ2) (mean zero, variance σ2)

We construct an estimate for f linear in the yi,

(1)f^(x0)=i=1Ni(x0;X)yi,

where the weights i(x0;X) do not depend on the yi, but do depend on the entire training sequence of xi, denoted here by X.

(a)

Show that linear regression and k-nearest-neighbor regression are members of this class of estimators. Describe explicitly the weights i(x0;X) in each of those cases.

(b)

Decompose the conditional mean-squared error

EY|X(f(x0)f^(x0))2

into a conditional squared bias and a conditional variance component. Like X,Y represents the entire training sequence of yi.

(c)

Decompose the (unconditional) mean-squared error

EY,X(f(x0)f^(x0))2

into a squared bias and a variance component.

(d)

Establish a relationship between the squared biases and variances in the above two cases.

Remark

A smoother f^ is called a linear smoother (see Section 5.4 in the text for more details) if it has the form

f^=Sy.

Note that the linearity implies that S does not depend on y. For linear regression, S=X(XTX)1XT. As for the bias and variance, we have

Bias(f^)=fSf,Cov(f^)=SCov(y)ST.
Soln. 2.7
(a)

For linear regression, we have

f^(x0)=[x0,1](XTX)1XTy,

so that

i(x0;X)=[x0,1](XTX)1(1xi).

For k-nearest-neighbor regression, we have

f^(x0)=1kxiNk(x0)yi,

where Nk(x0) is the neighborhood of x0 defined by the k closest points xi in the training sample. Therefore,

i(x0;X)={1k,if xiNk(x0)0,otherwise.
(b)

Note that X is fixed and randomness comes from Y only. We have

EY|X(f(x0)f^(x0))2=f(x0)22f(x0)EY|X(f^(x0))+EY|X(f^(x0)2)=(f(x0)EY|X(f^(x0)))2+EY|X(f^(x0)2)(EY|X(f^(x0)))2=BiasY|X(f^(x0))2+VarY|X(f^(x0)).
(c)

The calculation logic is the same as (b), we have

EY,X(f(x0)f^(x0))2=f(x0)22f(x0)EY,X(f^(x0))+EY,X(f^(x0)2)=(f(x0)EY,X(f^(x0)))2+EY,X(f^(x0)2)(EY,X(f^(x0)))2=Bias(f^(x0))2+Var(f^(x0)).
(d)

From (b) we already see that BiasY|X(f^(x0)) can be written as

f(x0)EY|Xf^(x0)=f(x0)i=1NEY|Xi(x0;X)(f(xi)+ϵi)(2)=f(x0)i=1Ni(x0;X)f(xi)         

Also, we write VarY|X(f^(x0)) as

EY|X(f^(x0)2)(EY,X(f^(x0)))2=EY|X[i=1Nj=1Ni(x0;X)j(x0;X)(f(x0)+ϵi)(f(x0)+ϵj)](i=1Ni(x0;X)f(xi))2=i=1Nj=1Ni(x0;X)j(x0;X)f(xi)f(xj)+σ2i=1Ni2(x0;X)i=1Nj=1Ni(x0;X)j(x0;X)f(xi)f(xj)=σ2i=1Ni2(x0;X)

Denote S=(1(x0;X),...,N(x0;X))T and f=(f(x1),...,f(xN))T. By (2) and the equation above, we have

BiasY|X(f^(x0))=f(x0)STf,VarY|X(f^(x0))=σ2STS.

Assume that SST is non-singular, note that STS is a scalar, we have

[BiasY|X(f^(x0))]2=(f(x0)S(x0)Tf)T(f(x0)S(x0)Tf)=f(x0)2+2f(x0)STffTSSTf=f(x0)2+2f(x0)STffTSSTSST(SST)1f=f(x0)2+2f(x0)STfVarY|X(f^(x0))σ2fTf=f(x0)2+2f(x0)(f(x0)BiasY|X(f^(x0)))fTfσ2VarY|X(f^(x0)).

That is the relationship between the squared biases and variances.

For (c), similar arguments follow by integrating terms above by joint density of x1,...,xN, that is, h(x1)h(xN)dx1dxN.