Ex. 4.4

Ex. 4.4

Consider the multilogic model with \(K\) classes (4.17) in the textbook. Let \(\beta\) be the \((p+1)(K-1)-\)vector consisting of all the coefficients. Define a suitably enlarged version of the input vector \(x\) to accommodate this vectorized coefficient matrix. Derive the Newton-Raphson algorithm for maximizing the multinomial.

Soln. 4.4

We follow the development in Section 4.4.1 in the textbook. As usual, we enlarge each observation \(x\in \mathbb{R}^p\) by inserting a constant \(1\) at the head position so that \(x^T \leftarrow (1, x^T)\). Thus we have

\[\begin{equation} P(G=k|X=x) = \frac{\exp(\beta_k^Tx)}{1 + \sum_{l=1}^{K-1}\exp(\beta_l^Tx)},\ \ k =1,...,K-1\non \end{equation}\]

and

\[\begin{equation} P(G=K|X=x) = \frac{1}{1 + \sum_{l=1}^{K-1}\exp(\beta_l^Tx)}.\non \end{equation}\]

Each \(\beta_k\in \mathbb{R}^{p+1}\) for \(k=1,...,K-1\). Denote \(\beta = \{\beta_1^T, ..., \beta_{K-1}^T\}\) so that \(\beta\) is the \((p+1)(K-1)-\)vector consisting of all the coefficients. Let

\[\begin{equation} \label{eq:4.4pkxi} p_k(x;\beta) = P(G=k|X=x).\non \end{equation}\]

The log-likelihood for \(N\) observations is

\[\begin{equation} l(\beta) = \sum_{i=1}^N\log p_{g_i}(x_i;\beta).\non \end{equation}\]

We code the class from \(1\) to \(K\) so that \(l(\beta)\) can be rewritten as

\[\begin{eqnarray} l(\beta) &=& \sum_{i=1}^N\left(\sum_{k=1}^{K-1}\bb{1}(y_i=k)\log p_k(x_i;\beta) - \bb{1}(y_i=K)\log \left(1+\sum_{l=1}^{K-1}e^{\beta_l^Tx_i}\right)\right)\non\\ &=&\sum_{i=1}^N\left(\sum_{k=1}^{K-1}\bb{1}(y_i=k)\beta_k^Tx_i-\log\big(1+\sum_{l=1}^{K-1}e^{\beta_l^Tx_i}\big)\right)\label{eq:multiclasslrlog} \end{eqnarray}\]

To maximize the log-likelihood, we set its derivatives to zero. These score equations, for \(k=1,...,K-1\), are

\[\begin{eqnarray} \frac{\partial l(\beta)}{\partial \beta_k} &=& \sum_{i=1}^N \left(\bb{1}(y_i=k)x_i - \frac{x_ie^{\beta_k^Tx_i}}{1+\sum_{l=1}^{K-1}e^{\beta_l^Tx_i}}\right)\non\\ &=&\sum_{i=1}^N x_i\left(\bb{1}(y_i=k) - \frac{e^{\beta_k^Tx_i}}{1+\sum_{l=1}^{K-1}e^{\beta_l^Tx_i}}\right)\in \mathbb{R}^{p+1}\label{eq:multiclasslr1st} \end{eqnarray}\]

We write

\[\begin{equation} \label{eq:multiclasslr1stmat} \frac{\partial l(\beta)}{\partial \beta} = \begin{pmatrix} \frac{\partial l(\beta)}{\partial \beta_1}\\ \vdots\\ \frac{\partial l(\beta)}{\partial \beta_{K-1}} \end{pmatrix}\in \mathbb{R}^{(p+1)(K-1)}. \end{equation}\]

We next look for the second order derivatives of \(\beta\). By \(\eqref{eq:multiclasslr1st}\), for \(k\neq j \in \{1,...,K-1\}\), we get

\[\begin{eqnarray} \frac{\partial^2 l(\beta)}{\partial \beta_k\partial \beta_j^T} &=& \sum_{i=1}^Nx_i \left(-\frac{e^{\beta_j^Tx_i}x_i^Te^{\beta_k^Tx_i}}{(1+\sum_{l=1}^{K-1}e^{\beta_l^Tx_i})^2}\right)\non\\ &=& - \sum_{i=1}^Nx_ix_i^T\frac{e^{\beta_k^Tx_i}}{1+\sum_{l=1}^{K-1}e^{\beta_l^Tx_i}}\cdot \frac{e^{\beta_j^Tx_i}}{1+\sum_{l=1}^{K-1}e^{\beta_l^Tx_i}}\non\\ &=& - \sum_{i=1}^Nx_ix_i^Tp_k(x_i;\beta)p_j(x_i;\beta).\label{eq:multiclasslr2nd} \end{eqnarray}\]

For \(k\in \{1,..., K-1\}\), we have

\[\begin{eqnarray} \frac{\partial^2 l(\beta)}{\partial \beta_k\partial \beta_k^T} &=&\sum_{i=1}^Nx_i \left(-\frac{e^{\beta_k^T}x_i^T}{(1+\sum_{l=1}^{K-1}e^{\beta_l^Tx_i})^2}\right)\non\\ &=&-\sum_{i=1}^Nx_ix_i^T\frac{e^{\beta_k^T}}{1+\sum_{l=1}^{K-1}e^{\beta_l^Tx_i}}\cdot\frac{1}{1+\sum_{l=1}^{K-1}e^{\beta_l^Tx_i}}\non\\ &=&-\sum_{i=1}^Nx_ix_i^Tp_k(x_i;\beta)p^c_k(x_i;\beta),\label{eq:multiclasslr2nd2} \end{eqnarray}\]

where \(p^c_k(x_i;\beta) = 1 - p_k(x_i;\beta)\).

By \(\eqref{eq:multiclasslr2nd}\) and \(\eqref{eq:multiclasslr2nd2}\), we write

\[\begin{equation} \frac{\partial^2l(\beta)}{\partial\beta\partial\beta^T} =\begin{pmatrix} \frac{\partial^2l(\beta)}{\partial\beta_1\partial\beta_1^T} & \frac{\partial^2l(\beta)}{\partial\beta_1\partial\beta_2^T} & \cdots & \frac{\partial^2l(\beta)}{\partial\beta_1\partial\beta_{K-1}^T}\non\\ \frac{\partial^2l(\beta)}{\partial\beta_2\partial\beta_2^T} & \frac{\partial^2l(\beta)}{\partial\beta_2\partial\beta_2^T} & \cdots & \frac{\partial^2l(\beta)}{\partial\beta_2\partial\beta_{K-1}^T}\non\\ \vdots & \vdots & \ddots & \vdots \\ \frac{\partial^2l(\beta)}{\partial\beta_{K-1}\partial\beta_1^T} & \frac{\partial^2l(\beta)}{\partial\beta_{K-1}\partial\beta_2^T} & \cdots & \frac{\partial^2l(\beta)}{\partial\beta_{K-1}\partial\beta_{K-1}^T} \end{pmatrix}\in \mathbb{R}^{(K-1)(p+1)\times (K-1)(p+1)} \end{equation}\]

Starting with \(\beta^{\text{old}}\), a single Newton update is

\[\begin{equation} \beta^{\text{new}} = \beta^{\text{old}} - \left(\frac{\partial^2l(\beta)}{\partial\beta\partial\beta^T}\right)^{-1}\frac{\partial l(\beta)}{\partial \beta}\non \end{equation}\]

where the derivative are evaluated at \(\beta^{\text{old}}\).

Next we write score and Hessian in matrix notation. Let's define, for \(k=1,...,K-1\),

\[\begin{equation} \bb{y}_k = \begin{pmatrix} \bb{1}(y_1=k)\\ \bb{1}(y_2=k)\\ \vdots\\ \bb{1}(y_N=k) \end{pmatrix}, \ \ \bb{X} = \begin{pmatrix} x_1^T\\ x_2^T\\ \vdots\\ x_{N}^T \end{pmatrix}, \ \ \bb{p}_k= \begin{pmatrix} p_k(x_1;\beta)\\ p_k(x_2;\beta)\\ \vdots\\ p_k(x_N;\beta) \end{pmatrix} \end{equation}\]

Then \(\eqref{eq:multiclasslr1st}\) is written in matrix notation as

\[\begin{equation} \frac{\partial l(\beta)}{\partial \beta_k} = \bb{X}^T(\bb{y}_k - \bb{p}_k).\non \end{equation}\]

Further define stacked vectors

\[\begin{equation} \bb{y} = \begin{pmatrix} \bb{y}_1\\ \bb{y}_2\\ \vdots\\ \bb{y}_{K-1} \end{pmatrix}\ \ \text{and} \ \ \bb{p} = \begin{pmatrix} \bb{p}_1\\ \bb{p}_2\\ \vdots\\ \bb{p}_{K-1} \end{pmatrix}.\non \end{equation}\]

We are able to rewrite \(\eqref{eq:multiclasslr1st}\) as

\[\begin{equation} \label{eq:4.4mat1st} \frac{\partial l(\beta)}{\partial \beta} = \begin{pmatrix} \bb{X}^T & 0 & \cdots & 0\\ 0 & \bb{X}^T & \cdots & 0\\ \vdots & \vdots & \ddots & \vdots\\ 0 & 0 & \cdots & \bb{X}^T \end{pmatrix} \begin{pmatrix} \bb{y}_1 - \bb{p}_1\\ \bb{y}_2 - \bb{p}_2\\ \vdots\\ \bb{y}_{K-1} - \bb{p}_{K-1} \end{pmatrix} = \boldsymbol{\hat X} (\bb{y} - \bb{p}) \end{equation}\]

where \(\boldsymbol{\hat X}\) is clearly the matrix above with \(\bb{X}^T\) on the diagonal positions.

For \(k=1,...,K-1\), let

\[\begin{equation} \bb{P}_k = \begin{pmatrix} p_k(x_1;\beta)p_k^c(x_1;\beta) & 0 & \cdots & 0\\ 0 & p_k(x_2;\beta)p_k^c(x_2\beta) & \cdots & 0\\ \vdots & \vdots & \ddots & \vdots\\ 0 & 0 & \cdots & p_k(x_N; \beta)p_k^c(x_N;\beta) \end{pmatrix}.\non \end{equation}\]

By \(\eqref{eq:multiclasslr2nd2}\), we have for \(k=1,...,K-1\) that

\[\begin{equation} \frac{\partial^2 l(\beta)}{\partial \beta_k\partial\beta^T_k} = - \bb{X}^T\bb{P}_k\bb{X}.\non \end{equation}\]

For \(k=1,...,K-1\), let

\[\begin{equation} \bb{R}_k = \begin{pmatrix} p_k(x_1;\beta) & 0 & \cdots & 0\\ 0 & p_k(x_2;\beta) & \cdots & 0\\ \vdots & \vdots & \ddots & \vdots\\ 0 & 0 & \cdots & p_k(x_N; \beta) \end{pmatrix}.\non \end{equation}\]

By \(\eqref{eq:multiclasslr2nd}\), we have for \(k\neq j\in \{1,...,K-1\}\) that

\[\begin{equation} \frac{\partial^2 l(\beta)}{\partial \beta_k\partial\beta^T_j} = - \bb{X}^T\bb{R}_k\bb{R}_j\bb{X}.\non \end{equation}\]

Therefore we have

\[\begin{equation} \frac{\partial^2 l(\beta)}{\partial\beta\partial\beta^T} = -\begin{pmatrix} \bb{X}^T\bb{P}_1\bb{X}& \bb{X}^T\bb{R}_1\bb{R}_2\bb{X}&\cdots&\bb{X}^T\bb{R}_{1}\bb{R}_{K-1}\bb{X}\\ \bb{X}^T\bb{R}_2\bb{R}_2\bb{X}& \bb{X}^T\bb{P}_2\bb{X}&\cdots&\bb{X}^T\bb{R}_{2}\bb{R}_{K-1}\bb{X}\\ \vdots & \vdots &\ddots &\vdots\\ \bb{X}^T\bb{R}_{K-1}\bb{R}_1\bb{X}& \bb{X}^T\bb{R}_{K-1}\bb{R}_2\bb{X}&\cdots&\bb{X}^T\bb{P}_{K-1}\bb{X} \end{pmatrix}.\non \end{equation}\]

Let

\[\begin{equation} \bb{W} = \begin{pmatrix} \bb{P}_1& \bb{R}_1\bb{R}_2&\cdots&\bb{R}_{1}\bb{R}_{K-1}\\ \bb{R}_2\bb{R}_2& \bb{P}_2&\cdots&\bb{R}_{2}\bb{R}_{K-1}\\ \vdots & \vdots &\ddots &\vdots\\ \bb{R}_{K-1}\bb{R}_1& \bb{R}_{K-1}\bb{R}_2&\cdots&\bb{P}_{K-1} \end{pmatrix}.\non \end{equation}\]

Recall \(\boldsymbol{\hat X}\) defined in \(\eqref{eq:4.4mat1st}\), we rewrite the Hessian equation above as

\[\begin{equation} \label{eq:ex4.4hessianmatrix} \frac{\partial^2 l(\beta)}{\partial\beta\partial\beta^T} = - \boldsymbol{\hat X}^T\bb{W}\boldsymbol{\hat X}.\non \end{equation}\]

The Newton step is thus

\[\begin{eqnarray} \beta^{\text{new}} &=& \beta^{\text{old}} + (\boldsymbol{\hat X}^T\bb{W}\boldsymbol{\hat X})^{-1}\bb{X}^T(\bb{y}-\bb{p})\non\\ &=&(\boldsymbol{\hat X}^T\bb{W}\boldsymbol{\hat X})^{-1}\boldsymbol{\hat X}^T\bb{W}(\boldsymbol{\hat X}\beta^{\text{old}} + \bb{W}^{-1}(\bb{y}-\bb{p}))\non\\ &=&(\boldsymbol{\hat X}^T\bb{W}\boldsymbol{\hat X})^{-1}\boldsymbol{\hat X}^T\bb{W}\bb{z}.\non \end{eqnarray}\]

In the second and third line we have re-expressed the Newton step as a weighted least squares step, with the response

\[\begin{equation} \bb{z} = \boldsymbol{\hat X}\beta^{\text{old}} + \bb{W}^{-1}(\bb{y}-\bb{p}).\non \end{equation}\]