Ex. 11.3
Ex. 11.3
Derive the forward and backward propagation equations for the cross-entropy loss function.
Soln. 11.3
For cross-entropy (deviance) we have
\[\begin{equation}
\label{eq:11-3a}
R(\theta) = -\sum_{i=1}^N\sum_{k=1}^Ky_{ik}\log(f_k(x_i)),\non
\end{equation}\]
and the corresponding classifier is \(G(x) = \underset{k}{\operatorname{argmax}}f_k(x)\).
Let \(z_{mi} = \sigma(\alpha_{0m} + \alpha_m^Tx_i)\) from (11.5) in the text. Let \(z_i = (z_{1i}, z_{2i}, ..., z_{Mi})\). Then we have
\[\begin{eqnarray}
R(\theta) &=& \sum_{i=1}^NR_i\non\\
&=&\sum_{i=1}^N\sum_{k=1}^K\left(-y_{ik}\log(f_k(x_i))\right),\non
\end{eqnarray}\]
with derivatives
\[\begin{eqnarray}
\frac{\partial R_i}{\partial \beta_{km}} &=& -\frac{y_{ik}}{f_k(x_i)}g'_k(\beta_k^Tz_i)z_{mi},\non\\
\frac{\partial R_i}{\partial \alpha_{ml}} &=& -\sum_{k=1}^K\frac{y_{ik}}{f_k(x_i)}g'_k(\beta_k^Tz_i)\beta_{km}\sigma'(\alpha_{0m} + \alpha_m^Tx_i)x_{il}.\label{eq:11-3b}
\end{eqnarray}\]
Given these derivatives, a gradient descent update at the \((r+1)\)st iteration has the form
\[\begin{eqnarray}
\beta_{km}^{(r+1)} &=& \beta_{km}^{(r)} - \gamma_r\sum_{i=1}^N\frac{\partial R_i}{\partial \beta^{(r)}_{km}},\non\\
\alpha_{ml}^{(r+1)} &=& \alpha_{ml}^{(r)} - \gamma_r\sum_{i=1}^N\frac{\partial R_i}{\partial \alpha^{(r)}_{ml}},\non
\end{eqnarray}\]
where \(\gamma_r\) is the learning rate. We write \(\eqref{eq:11-3b}\) as
\[\begin{eqnarray}
\frac{\partial R_i}{\partial \beta_{km}} &=& \delta_{ki}z_{mi},\non\\
\frac{\partial R_i}{\partial \alpha_{ml}} &=& s_{mi}x_{il}.\non
\end{eqnarray}\]
From their definitions, we have
\[\begin{equation}
s_{mi} = \sigma'(\alpha_{0m}+\alpha_m^Tx_i)\sum_{k=1}^K\beta_{km}\delta_{ki},\non
\end{equation}\]
known as the back-propagation equations.