The dataset includes (28x28 = 784) values as an encoded image in rgb format {(a,b,c) where a,b,c in [0,255]
where 0=>completely black, 255=>completely white}. This exists as a matrix for each "image".
So for the input layer (0th) we take 784 nodes which will have the matrix representing the image.
We will use only one hidden layer (1st) as the prediction we want to make isn't very complex. In it we will
have 10 nodes and they will be applying a specialized linear transformation where each of the nodes returns a
value based on the input matrix.
The value for the hidden layer is calculated as follows:
- We multiply it by a certain weight.
- We add a bias to it.
- We apply a ReLu function to it {x=0;x=<0 and x;x>0}
- Note that these weights and biases are what we are actually going to optimize for in the training part of the neural network.
In the output layer (2nd layer) we have 10 nodes which represent the probabilities of each of the 10 outputs (0-9)
The value for the output layer is calculated as follows:
- We multiply it by a certain weight.
- We add a bias to it.
- We apply a softmax function to it which outputs a value between 0 and 1 which we require for the purpose of it being a probability.
- The softmax function is {(e to the power of instance value)/(summation of e to the power of instance value for all instances)}
- Note that these weights and biases are what we are actually going to optimize for in the training part of the neural network.
The number corresponding to the node with highest probability will be considered as the output of the neural network.