Ex. 17.9

Suppose that we have a Gaussian graphical model in which some or all of the data at some vertices are missing.

(a) Consider the EM algorithm for a dataset of N i.i.d. multivariate observations \(x_i\in\mathbb{R}^p\) with mean \(\mu\) and covariance matrix \(\bm{\Sigma}\). For each sample \(i\), let \(o_i\) and \(m_i\) index the predictors that are observed and missing, respectively. Show that in the E step, the observations are imputed from the current estimates of \(\mu\) and \(\bm{\Sigma}\):

\[\begin{equation} \hat x_{i,m_i} = E(x_{i,m_i}|x_{i, o_i}, \theta) = \hat{\mu}_{m_i} + \hat\Sigma_{m_i,o_i}\hat\Sigma_{o_i,o_i}^{-1}(x_{i,o_i}-\hat\mu_{o_i})\non \end{equation}\]

while in the M step, \(\mu\) and \(\bm{\Sigma}\) are re-estimated from the empirical mean and (modified) covariance of the imputed data:

\[\begin{eqnarray} \hat\mu_{j} &=& \sum_{i=1}^N\hat x_{ij}/N\non\\ \hat\Sigma_{jj'} &=& \sum_{i=1}^N[(\hat x_{ij}-\hat\mu_{ij})(\hat x_{ij'}-\hat \mu_{j'}) + c_{i,jj'}]/N\non \end{eqnarray}\]

where \(c_{i,jj'}=\hat\Sigma_{jj'}\) if \(j, j'\in m_i\) and zero otherwise. Explain the reason for the correction term \(c_{i, jj'}\) (\cite{little2019statistical}).

(b) Implement the EM algorithm for the Gaussian graphical model using the modified regression procedure from Exercise 17.7 for the M-step.

(c) For the flow cytometry data on the book website, set the data for the last protein Jnk in the first 1000 observations to missing, fit the model of Figure 17.1, and compare the predicted values to the actual values for Jnk. Compare the results to those obtained from a regression of Jnk on the other vertices with edges to Jnk in Figure 17.1, using only the non-missing data.

Soln. 17.9

(a) In the E-Step, the imputed estimate for missing variables follows directly from equation (17.16) in the text. In the M-Step, it's easy to see that \(\hat\mu\) is the average of imputed observations. Specifically, if \(x_{ij}\) is not missing, \(\hat x_{ij} = x_{ij}\), otherwise, the imputed (conditional mean) \(E[x_{ij}|x_{i, o}, \theta]\) is used.

For \(\Sigma\), recall (17.11) in the text, the maximum likelihood estimate is simply the (modified) covariance \(\bb{S}\) (see, e.g., (17.10)), where the additional correction term \(c_{i, jk}\) results from the imputation of missing values by their conditional expectations.

(b) Note that if the graph is complete, then we could just use equations derived in (a) for the EM algorithm. When the graph is not complete, we need to use Algorithm 17.1 implemented in Ex. 17.7 to estimate \(\Sigma\).

(c) The imputed 1000 values from EM algorithm has a mean of -38.65, while the true mean is -33.20, the mean square error is around 1972.32.

Code

import pandas as pd
import numpy as np
import pathlib
from sklearn.metrics import mean_squared_error
from GraphicalGaussian import GraphicalGaussian, GraphicalGaussianEM

# get relative data folder
PATH = pathlib.Path(__file__).resolve().parents[1]
DATA_PATH = PATH.joinpath("data").resolve()

# data
X = pd.read_csv(DATA_PATH.joinpath("cytometry_data.csv"), header=0)
X = np.asarray(X)
"""
Gamma: represents network in Figure 17.1
    X1 - X11 are in the following order
    Raf, Mek, Plcg, PIP2, PIP3, Erk, Akt, PKA, PKC, P38, Jnk
        -0.55 & 0.36 & 0 & 0 & 0 & 0 & -0.0048 & 0.00046 & 0 & -6.5 & 0
    if two nodes i, j are connected, Gamma[i][j] = 0, else Gamma[i][j] = 1 
"""
Gamma = pd.read_csv(DATA_PATH.joinpath("cytometry_gamma.csv"), header=0)
Gamma = np.asarray(Gamma)

# set first 1000 Jnk as NaN
X_jnk = X.copy()
X_jnk[np.arange(1000), -1] = np.nan

model_EM = GraphicalGaussianEM()
model_EM.fit(X_jnk, Gamma)
cov_EM = model_EM.covariance_
est_X = model_EM.imputed_X

est_values, true_values = est_X[np.arange(1000), -1], X[np.arange(1000), -1]
print('Estimated mean is {} vs actual mean is {}'.format(np.mean(est_values), np.mean(true_values)))
print('MSE is {}'.format(mean_squared_error(est_values, true_values)))

# drop missing value
X_drop = X.copy()
X_drop = X[1000:, :]

model = GraphicalGaussian(verbose=True)
S = np.cov(X_drop, rowvar=False)
model.fit(S, Gamma)

print('The norm of the difference between to estimated covariance matrix is {}'
    .format(np.linalg.norm(model.covariance_ - cov_EM)))