The last time Hackerfall tried to access this page, it returned a not found error.
A cached version of the page is below, or click here to continue anyway

As you can see, our 3-layer network has 2 inputs, 3 hidden neurons, and 1 output (ignoring the bias nodes). There are two theta matrices between each layer, $\theta_1 = 3x3$ and $\theta_2 = 4x1$ (referring to matrix dimensions). Great, now how do we calculate those partial derivatives so we can do gradient descent? Here are the steps.

1. Starting at the output neurons, we calculate their node delta, $\delta$. In our case, we only have one output neuron, and therefore a $\delta_1$. Delta's represent the error for each node in the network. It's easy to calculate the output node's delta, it's simply $\delta_1 = (h(x) - y)$ which is what the output neuron output minus $y$, the expected value. Note, even if we had more than one output neuron, we could calculate all their deltas, $\delta_1 ... \delta_j$, in the same way, $\delta_j^{L} = (h(x) - y)$ and at the same time if $h(x)$ and $y$ are vectors. L=num layers, thus $\delta_j^{L}$ refers to output node deltas.

2. Here's where we start the backpropagation. To calculate the previous layer's (in our case, the hidden layer) deltas, we backpropagate the output errors/deltas using this formula:
$$\delta_j^{l} = (\theta_l * \delta_j^{l+1}) \odot (a^{l} \odot (1 - a^{l}))$$
Where $*$ indicates the dot product, $\odot$ indicates element-wise multiplication (Hadamard product), and $l$ = the layer number (e.g. $l=2$ is the hidden layer for our example NN). So $\delta_j^3$ refers to the output layer deltas, whereas $\delta_j^{2}$ refers to the previous, hidden layer deltas, and $\delta_j^{1}$ would be 2 layers before the output layer (the input layer). $a^{l}$ refers to the activations/outputs of layer $l$ (e.g. the hidden layer if $l=2$). Note: **We only calculate delta's up to the last hidden layer, we don't calculate deltas for the input layer.**

3. Calculate the gradients using this formula: $$\frac{\partial C}{\partial \theta_j^l} = \delta^{l+1} * a^{l}$$
For example, the gradients for weights between the hidden layer and output layer ($\theta_2$, $l=2$) are $\frac{\partial C}{\partial \theta_j^2} = \delta^{3} * a^{2}$
**Important:** We do NOT backpropagate deltas from bias units. Thus, when we calculate the gradients for $\theta_1$ (the weights between input layer and hidden layer)
we should omit the bias unit delta from the hidden layer like this:
$\frac{\partial C}{\partial \theta_j^1} = \delta^{2} * a^{1}$, where $\delta^{2}$ includes $\delta_1^{2}$ and $\delta_2^{2}$, NOT $\delta_0^{2}$ (the bias unit, j = 0).

4. Use these gradients to perform gradient descent like normal.

Note: If you want to learn where the backpropagation equations came from, check out my post on computational graphs.