Ex. 17.9
Ex. 17.9
Suppose that we have a Gaussian graphical model in which some or all of the data at some vertices are missing.
(a) Consider the EM algorithm for a dataset of N i.i.d. multivariate observations \(x_i\in\mathbb{R}^p\) with mean \(\mu\) and covariance matrix \(\bm{\Sigma}\). For each sample \(i\), let \(o_i\) and \(m_i\) index the predictors that are observed and missing, respectively. Show that in the E step, the observations are imputed from the current estimates of \(\mu\) and \(\bm{\Sigma}\):
while in the M step, \(\mu\) and \(\bm{\Sigma}\) are re-estimated from the empirical mean and (modified) covariance of the imputed data:
where \(c_{i,jj'}=\hat\Sigma_{jj'}\) if \(j, j'\in m_i\) and zero otherwise. Explain the reason for the correction term \(c_{i, jj'}\) (\cite{little2019statistical}).
(b) Implement the EM algorithm for the Gaussian graphical model using the modified regression procedure from Exercise 17.7 for the M-step.
(c) For the flow cytometry data on the book website, set the data for the last protein Jnk in the first 1000 observations to missing, fit the model of Figure 17.1, and compare the predicted values to the actual values for Jnk. Compare the results to those obtained from a regression of Jnk on the other vertices with edges to Jnk in Figure 17.1, using only the non-missing data.
Soln. 17.9
(a) In the E-Step, the imputed estimate for missing variables follows directly from equation (17.16) in the text. In the M-Step, it's easy to see that \(\hat\mu\) is the average of imputed observations. Specifically, if \(x_{ij}\) is not missing, \(\hat x_{ij} = x_{ij}\), otherwise, the imputed (conditional mean) \(E[x_{ij}|x_{i, o}, \theta]\) is used.
For \(\Sigma\), recall (17.11) in the text, the maximum likelihood estimate is simply the (modified) covariance \(\bb{S}\) (see, e.g., (17.10)), where the additional correction term \(c_{i, jk}\) results from the imputation of missing values by their conditional expectations.
(b) Note that if the graph is complete, then we could just use equations derived in (a) for the EM algorithm. When the graph is not complete, we need to use Algorithm 17.1 implemented in Ex. 17.7 to estimate \(\Sigma\).
(c) The imputed 1000 values from EM algorithm has a mean of -38.65, while the true mean is -33.20, the mean square error is around 1972.32.
Code
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 |
|