Ex. 2.8

Ex. 2.8

Compare the classification performance of linear regression and \(k\)-nearest neighbor classification on the zipcode data. In particular, consider only the 2's and 3's, and \(k=1,3,5,7\) and 15. Show both the training and test error for each choice. The zipcode data are available from the website.

Soln. 2.8

We summarize error rates obtained from our simulation experiments in the table below.

Model Train Error Rate Test Error Rate
Linear Regression 0.58% 4.12%
1-NN 0.00% 2.47%
3-NN 0.50% 3.02%
5-NN 0.58% 3.02%
7-NN 0.65% 3.30%
15-NN 0.94% 3.85%

We see that as the number of neighbors \(k\) increases, both the train and test error rates tend to increase. In the extreme case when \(k=1\), the train error is 0 as expected.

Code
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
import pathlib
import numpy as np
from sklearn.linear_model import LinearRegression
from sklearn.neighbors import KNeighborsClassifier

# get relative data folder
PATH = pathlib.Path(__file__).resolve().parents[1]
DATA_PATH = PATH.joinpath("data").resolve()

# get original data
train = np.genfromtxt(DATA_PATH.joinpath("zip_train_2and3.csv"), dtype=float, delimiter=',', skip_header=True)
test = np.genfromtxt(DATA_PATH.joinpath("zip_test_2and3.csv"), dtype=float, delimiter=',', skip_header=True)

# prepare training and testing data
x_train, y_train = train[:, 1:], train[:, 0]
x_test, y_test = test[:, 1:], test[:, 0]

# for classification purpose
# we assign 1 to digit '3' and 0 to '2'
y_train[y_train == 3] = 1
y_train[y_train == 2] = 0
y_test[y_test == 3] = 1
y_test[y_test == 2] = 0


# a utility function to assign prediction
def assign(arr):
    arr[arr >= 0.5] = 1
    arr[arr < 0.5] = 0


# a utility function to calculate error rate
# of predictions
def getErrorRate(a, b):
    if a.size != b.size:
        raise ValueError('Expect input arrays have equal size, a has {}, b has {}'.
                        format(a.size, b.size))

    if a.size == 0:
        raise ValueError('Expect non-empty input arrays')

    return np.sum(a != b) / a.size


# Linear Regression
reg = LinearRegression().fit(x_train, y_train)
pred_test = reg.predict(x_test)
assign(pred_test)
print('Test error rate of Linear Regression is {:.2%}'
    .format(getErrorRate(pred_test, y_test)))


pred_train = reg.predict(x_train)
assign(pred_train)
print('Train error rate of Linear Regression is {:.2%}'
    .format(getErrorRate(pred_train, y_train)))


# run separate K-NN classifiers
for k in range(1, 16):
    # fit the model
    neigh = KNeighborsClassifier(n_neighbors=k)
    neigh.fit(x_train, y_train)

    # test error
    pred_knn_test = neigh.predict(x_test)
    assign(pred_knn_test)
    test_error_rate = getErrorRate(pred_knn_test, y_test)
    # train error
    pred_knn_train = neigh.predict(x_train)
    assign(pred_knn_train)
    train_error_rate = getErrorRate(pred_knn_train, y_train)

    print('k-NN Model: k is {}, train/test error rates are {:.2%} and {:.2%}'
        .format(k, train_error_rate, test_error_rate))