Soln. 4.4
We follow the development in Section 4.4.1 in the textbook. As usual, we enlarge each observation \(x\in \mathbb{R}^p\) by inserting a constant \(1\) at the head position so that \(x^T \leftarrow (1, x^T)\). Thus we have
\[\begin{equation}
P(G=k|X=x) = \frac{\exp(\beta_k^Tx)}{1 + \sum_{l=1}^{K-1}\exp(\beta_l^Tx)},\ \ k =1,...,K-1\non
\end{equation}\]
and
\[\begin{equation}
P(G=K|X=x) = \frac{1}{1 + \sum_{l=1}^{K-1}\exp(\beta_l^Tx)}.\non
\end{equation}\]
Each \(\beta_k\in \mathbb{R}^{p+1}\) for \(k=1,...,K-1\). Denote \(\beta = \{\beta_1^T, ..., \beta_{K-1}^T\}\) so that \(\beta\) is the \((p+1)(K-1)-\)vector consisting of all the coefficients. Let
\[\begin{equation}
\label{eq:4.4pkxi}
p_k(x;\beta) = P(G=k|X=x).\non
\end{equation}\]
The log-likelihood for \(N\) observations is
\[\begin{equation}
l(\beta) = \sum_{i=1}^N\log p_{g_i}(x_i;\beta).\non
\end{equation}\]
We code the class from \(1\) to \(K\) so that \(l(\beta)\) can be rewritten as
\[\begin{eqnarray}
l(\beta) &=& \sum_{i=1}^N\left(\sum_{k=1}^{K-1}\bb{1}(y_i=k)\log p_k(x_i;\beta) - \bb{1}(y_i=K)\log \left(1+\sum_{l=1}^{K-1}e^{\beta_l^Tx_i}\right)\right)\non\\
&=&\sum_{i=1}^N\left(\sum_{k=1}^{K-1}\bb{1}(y_i=k)\beta_k^Tx_i-\log\big(1+\sum_{l=1}^{K-1}e^{\beta_l^Tx_i}\big)\right)\label{eq:multiclasslrlog}
\end{eqnarray}\]
To maximize the log-likelihood, we set its derivatives to zero. These score equations, for \(k=1,...,K-1\), are
\[\begin{eqnarray}
\frac{\partial l(\beta)}{\partial \beta_k} &=& \sum_{i=1}^N \left(\bb{1}(y_i=k)x_i - \frac{x_ie^{\beta_k^Tx_i}}{1+\sum_{l=1}^{K-1}e^{\beta_l^Tx_i}}\right)\non\\
&=&\sum_{i=1}^N x_i\left(\bb{1}(y_i=k) - \frac{e^{\beta_k^Tx_i}}{1+\sum_{l=1}^{K-1}e^{\beta_l^Tx_i}}\right)\in \mathbb{R}^{p+1}\label{eq:multiclasslr1st}
\end{eqnarray}\]
We write
\[\begin{equation}
\label{eq:multiclasslr1stmat}
\frac{\partial l(\beta)}{\partial \beta} =
\begin{pmatrix}
\frac{\partial l(\beta)}{\partial \beta_1}\\
\vdots\\
\frac{\partial l(\beta)}{\partial \beta_{K-1}}
\end{pmatrix}\in \mathbb{R}^{(p+1)(K-1)}.
\end{equation}\]
We next look for the second order derivatives of \(\beta\). By \(\eqref{eq:multiclasslr1st}\), for \(k\neq j \in \{1,...,K-1\}\), we get
\[\begin{eqnarray}
\frac{\partial^2 l(\beta)}{\partial \beta_k\partial \beta_j^T}
&=& \sum_{i=1}^Nx_i \left(-\frac{e^{\beta_j^Tx_i}x_i^Te^{\beta_k^Tx_i}}{(1+\sum_{l=1}^{K-1}e^{\beta_l^Tx_i})^2}\right)\non\\
&=& - \sum_{i=1}^Nx_ix_i^T\frac{e^{\beta_k^Tx_i}}{1+\sum_{l=1}^{K-1}e^{\beta_l^Tx_i}}\cdot \frac{e^{\beta_j^Tx_i}}{1+\sum_{l=1}^{K-1}e^{\beta_l^Tx_i}}\non\\
&=& - \sum_{i=1}^Nx_ix_i^Tp_k(x_i;\beta)p_j(x_i;\beta).\label{eq:multiclasslr2nd}
\end{eqnarray}\]
For \(k\in \{1,..., K-1\}\), we have
\[\begin{eqnarray}
\frac{\partial^2 l(\beta)}{\partial \beta_k\partial \beta_k^T}
&=&\sum_{i=1}^Nx_i \left(-\frac{e^{\beta_k^T}x_i^T}{(1+\sum_{l=1}^{K-1}e^{\beta_l^Tx_i})^2}\right)\non\\
&=&-\sum_{i=1}^Nx_ix_i^T\frac{e^{\beta_k^T}}{1+\sum_{l=1}^{K-1}e^{\beta_l^Tx_i}}\cdot\frac{1}{1+\sum_{l=1}^{K-1}e^{\beta_l^Tx_i}}\non\\
&=&-\sum_{i=1}^Nx_ix_i^Tp_k(x_i;\beta)p^c_k(x_i;\beta),\label{eq:multiclasslr2nd2}
\end{eqnarray}\]
where \(p^c_k(x_i;\beta) = 1 - p_k(x_i;\beta)\).
By \(\eqref{eq:multiclasslr2nd}\) and \(\eqref{eq:multiclasslr2nd2}\), we write
\[\begin{equation}
\frac{\partial^2l(\beta)}{\partial\beta\partial\beta^T}
=\begin{pmatrix}
\frac{\partial^2l(\beta)}{\partial\beta_1\partial\beta_1^T} & \frac{\partial^2l(\beta)}{\partial\beta_1\partial\beta_2^T} & \cdots & \frac{\partial^2l(\beta)}{\partial\beta_1\partial\beta_{K-1}^T}\non\\
\frac{\partial^2l(\beta)}{\partial\beta_2\partial\beta_2^T} & \frac{\partial^2l(\beta)}{\partial\beta_2\partial\beta_2^T} & \cdots & \frac{\partial^2l(\beta)}{\partial\beta_2\partial\beta_{K-1}^T}\non\\
\vdots & \vdots & \ddots & \vdots \\
\frac{\partial^2l(\beta)}{\partial\beta_{K-1}\partial\beta_1^T} & \frac{\partial^2l(\beta)}{\partial\beta_{K-1}\partial\beta_2^T} & \cdots & \frac{\partial^2l(\beta)}{\partial\beta_{K-1}\partial\beta_{K-1}^T}
\end{pmatrix}\in \mathbb{R}^{(K-1)(p+1)\times (K-1)(p+1)}
\end{equation}\]
Starting with \(\beta^{\text{old}}\), a single Newton update is
\[\begin{equation}
\beta^{\text{new}} = \beta^{\text{old}} - \left(\frac{\partial^2l(\beta)}{\partial\beta\partial\beta^T}\right)^{-1}\frac{\partial l(\beta)}{\partial \beta}\non
\end{equation}\]
where the derivative are evaluated at \(\beta^{\text{old}}\).
Next we write score and Hessian in matrix notation. Let's define, for \(k=1,...,K-1\),
\[\begin{equation}
\bb{y}_k =
\begin{pmatrix}
\bb{1}(y_1=k)\\
\bb{1}(y_2=k)\\
\vdots\\
\bb{1}(y_N=k)
\end{pmatrix}, \ \
\bb{X} = \begin{pmatrix}
x_1^T\\
x_2^T\\
\vdots\\
x_{N}^T
\end{pmatrix}, \ \
\bb{p}_k=
\begin{pmatrix}
p_k(x_1;\beta)\\
p_k(x_2;\beta)\\
\vdots\\
p_k(x_N;\beta)
\end{pmatrix}
\end{equation}\]
Then \(\eqref{eq:multiclasslr1st}\) is written in matrix notation as
\[\begin{equation}
\frac{\partial l(\beta)}{\partial \beta_k} = \bb{X}^T(\bb{y}_k - \bb{p}_k).\non
\end{equation}\]
Further define stacked vectors
\[\begin{equation}
\bb{y} = \begin{pmatrix}
\bb{y}_1\\
\bb{y}_2\\
\vdots\\
\bb{y}_{K-1}
\end{pmatrix}\ \ \text{and} \ \
\bb{p} = \begin{pmatrix}
\bb{p}_1\\
\bb{p}_2\\
\vdots\\
\bb{p}_{K-1}
\end{pmatrix}.\non
\end{equation}\]
We are able to rewrite \(\eqref{eq:multiclasslr1st}\) as
\[\begin{equation}
\label{eq:4.4mat1st}
\frac{\partial l(\beta)}{\partial \beta} =
\begin{pmatrix}
\bb{X}^T & 0 & \cdots & 0\\
0 & \bb{X}^T & \cdots & 0\\
\vdots & \vdots & \ddots & \vdots\\
0 & 0 & \cdots & \bb{X}^T
\end{pmatrix} \begin{pmatrix}
\bb{y}_1 - \bb{p}_1\\
\bb{y}_2 - \bb{p}_2\\
\vdots\\
\bb{y}_{K-1} - \bb{p}_{K-1}
\end{pmatrix} = \boldsymbol{\hat X} (\bb{y} - \bb{p})
\end{equation}\]
where \(\boldsymbol{\hat X}\) is clearly the matrix above with \(\bb{X}^T\) on the diagonal positions.
For \(k=1,...,K-1\), let
\[\begin{equation}
\bb{P}_k = \begin{pmatrix}
p_k(x_1;\beta)p_k^c(x_1;\beta) & 0 & \cdots & 0\\
0 & p_k(x_2;\beta)p_k^c(x_2\beta) & \cdots & 0\\
\vdots & \vdots & \ddots & \vdots\\
0 & 0 & \cdots & p_k(x_N; \beta)p_k^c(x_N;\beta)
\end{pmatrix}.\non
\end{equation}\]
By \(\eqref{eq:multiclasslr2nd2}\), we have for \(k=1,...,K-1\) that
\[\begin{equation}
\frac{\partial^2 l(\beta)}{\partial \beta_k\partial\beta^T_k} = - \bb{X}^T\bb{P}_k\bb{X}.\non
\end{equation}\]
For \(k=1,...,K-1\), let
\[\begin{equation}
\bb{R}_k = \begin{pmatrix}
p_k(x_1;\beta) & 0 & \cdots & 0\\
0 & p_k(x_2;\beta) & \cdots & 0\\
\vdots & \vdots & \ddots & \vdots\\
0 & 0 & \cdots & p_k(x_N; \beta)
\end{pmatrix}.\non
\end{equation}\]
By \(\eqref{eq:multiclasslr2nd}\), we have for \(k\neq j\in \{1,...,K-1\}\) that
\[\begin{equation}
\frac{\partial^2 l(\beta)}{\partial \beta_k\partial\beta^T_j} = - \bb{X}^T\bb{R}_k\bb{R}_j\bb{X}.\non
\end{equation}\]
Therefore we have
\[\begin{equation}
\frac{\partial^2 l(\beta)}{\partial\beta\partial\beta^T}
=
-\begin{pmatrix}
\bb{X}^T\bb{P}_1\bb{X}& \bb{X}^T\bb{R}_1\bb{R}_2\bb{X}&\cdots&\bb{X}^T\bb{R}_{1}\bb{R}_{K-1}\bb{X}\\
\bb{X}^T\bb{R}_2\bb{R}_2\bb{X}& \bb{X}^T\bb{P}_2\bb{X}&\cdots&\bb{X}^T\bb{R}_{2}\bb{R}_{K-1}\bb{X}\\
\vdots & \vdots &\ddots &\vdots\\
\bb{X}^T\bb{R}_{K-1}\bb{R}_1\bb{X}& \bb{X}^T\bb{R}_{K-1}\bb{R}_2\bb{X}&\cdots&\bb{X}^T\bb{P}_{K-1}\bb{X}
\end{pmatrix}.\non
\end{equation}\]
Let
\[\begin{equation}
\bb{W} = \begin{pmatrix}
\bb{P}_1& \bb{R}_1\bb{R}_2&\cdots&\bb{R}_{1}\bb{R}_{K-1}\\
\bb{R}_2\bb{R}_2& \bb{P}_2&\cdots&\bb{R}_{2}\bb{R}_{K-1}\\
\vdots & \vdots &\ddots &\vdots\\
\bb{R}_{K-1}\bb{R}_1& \bb{R}_{K-1}\bb{R}_2&\cdots&\bb{P}_{K-1}
\end{pmatrix}.\non
\end{equation}\]
Recall \(\boldsymbol{\hat X}\) defined in \(\eqref{eq:4.4mat1st}\), we rewrite the Hessian equation above as
\[\begin{equation}
\label{eq:ex4.4hessianmatrix}
\frac{\partial^2 l(\beta)}{\partial\beta\partial\beta^T} = - \boldsymbol{\hat X}^T\bb{W}\boldsymbol{\hat X}.\non
\end{equation}\]
The Newton step is thus
\[\begin{eqnarray}
\beta^{\text{new}} &=& \beta^{\text{old}} + (\boldsymbol{\hat X}^T\bb{W}\boldsymbol{\hat X})^{-1}\bb{X}^T(\bb{y}-\bb{p})\non\\
&=&(\boldsymbol{\hat X}^T\bb{W}\boldsymbol{\hat X})^{-1}\boldsymbol{\hat X}^T\bb{W}(\boldsymbol{\hat X}\beta^{\text{old}} + \bb{W}^{-1}(\bb{y}-\bb{p}))\non\\
&=&(\boldsymbol{\hat X}^T\bb{W}\boldsymbol{\hat X})^{-1}\boldsymbol{\hat X}^T\bb{W}\bb{z}.\non
\end{eqnarray}\]
In the second and third line we have re-expressed the Newton step as a weighted least squares step, with the response
\[\begin{equation}
\bb{z} = \boldsymbol{\hat X}\beta^{\text{old}} + \bb{W}^{-1}(\bb{y}-\bb{p}).\non
\end{equation}\]