Ex. 10.6

McNemar test (An Introduction to Categorical Data Analysis). We support the test error rates on the spam data to be 5.5% for a generalized additive model (GAM), and 4.5% for gradient boosting (GBM), with a test sample of size 1536.

(a) Show that the standard error of these estimates is about 0.6%. Since the same test data are used for both methods, the error rates are correlated, and we cannot perform a two-sample t-test. We can compare the methods directly on each test observation, leading to the summary in Table below.

GAM	GBM
	Correct	Error
Correct	1434	18
Error	33	51

The McNemar test focuses on the discordant errors, 33 vs. 18.

(b) Conduct a test to show that GAM makes significantly more errors than gradient boosting, with a two-sided \(p\)-value of 0.036.

Soln. 10.6

(a) The standard error for a binomial estimate is \(\sqrt{p(1-p)/n}\). It's straightforward to verify that

\[\begin{eqnarray} \sqrt{\frac{0.055\cdot(1-0.055)}{1536}} \approx 0.6\%\non\\ \sqrt{\frac{0.045\cdot(1-0.045)}{1536}} \approx 0.6\%\non \end{eqnarray}\]

(b) The McNemar test statistic is

\[\begin{equation} \chi^2 = \frac{(33-18)^2}{33+18}\approx 4.41,\non \end{equation}\]

which has a 1 degree of freedom, thus the \(p-\)value is 0.036.

Code

from statsmodels.stats.contingency_tables import mcnemar

# define contingency table
table = [[1434, 18],
        [33, 51]]

result = mcnemar(table, exact=False, correction=False)

print('statistic=%.3f, p-value=%.3f' % (result.statistic, result.pvalue))

# interpret the p-value
alpha = 0.05
if result.pvalue > alpha:
    print('Same proportions of errors (fail to reject H0)')
else:
    print('Different proportions of errors (reject H0)')