In: Mechanical Engineering
1:Find the convolution output volume size of layer 2 (W2xH2xD2) and 3 (W3xH3xD3)
2: Find the convolution output volume size of layer 2 (W2xH2xD2) and 3 (W3xH3xD3)
3: Find the output volume size of output pooling layer (W2xH2xD2)
1.Convolution output volume size of layer 2 (W2xH2xD2) and 3 (W3xH3xD3)
The output of a convolution layer is computed as the following:
the depth (No of feature maps) is equal to the number of filters applied in this layer
the width ( the same for height) is computed according to the following equation
W=(W?F+2P)/S+1 where f is the receptive field (filter width), p is the padding and s is the stride
A common setting of the hyperparameters is F=3,S=1,P=1F=3,S=1,P=1. However, there are common conventions and rules of thumb that motivate these hyperparameters.
It is common to periodically insert a Pooling layer in-between successive Conv layers in a ConvNet architecture. Its function is to progressively reduce the spatial size of the representation to reduce the amount of parameters and computation in the network, and hence to also control overfitting. The Pooling Layer operates independently on every depth slice of the input and resizes it spatially, using the MAX operation. The most common form is a pooling layer with filters of size 2x2 applied with a stride of 2 downsamples every depth slice in the input by 2 along both width and height, discarding 75% of the activations. Every MAX operation would in this case be taking a max over 4 numbers (little 2x2 region in some depth slice). The depth dimension remains unchanged. More generally, the pooling layer:
It is worth noting that there are only two commonly seen variations of the max pooling layer found in practice: A pooling layer with F=3,S=2F=3,S=2 (also called overlapping pooling), and more commonly F=2,S=2F=2,S=2. Pooling sizes with larger receptive fields are too destructive.
General pooling. In addition to max pooling, the pooling units can also perform other functions, such as average pooling or even L2-norm pooling. Average pooling was often used historically but has recently fallen out of favor compared to the max pooling operation, which has been shown to work better in practice.
2.
2D Convolutional Layers constitute Convolutional Neural Networks (CNNs) along with Pooling and fully-connected layers and create the basis of deep learning. So if you want to go deeper into CNNs and deep learning, the first step is to get more familiar with how Convolutional Layers work. If you are not familiar with applying 2D filters on images, . In the image filtering post, we talked about convolving a filter with an image. In that post, we had a 2D filter kernel (a 2D matrix) and a single channel image (grayscale image). To calculate the convolution, we swept the kernel (if you remember we should flip the kernel first and then do the convolution, for the rest of this post we assumed that the kernel is already flipped) on the image and at every single location we calculated the output. In fact, the stride of our convolution was 1. You might say what is a stride? stride is the number of pixels with which we slide our filter, horizontally or vertically. In other words, in that case we moved our filter one pixel at each step to calculate the next convoluion output. However, for a convolution with stride 2, we calculate the output for every other pixel (or jump 2 pixels) and as a contrary the output of the convolution would be roughly half the size of the input image. Figure 1 compares two 2D convolutions with strides one and two, respectively.
Note that ,you can have different strides horizontally and vertically. You can use the following equations to calculate the exact size of the convolution output for an input with the size of (width = WW, height = HH) and a Filter with the size of (width = FwFw, height = FhFh):
output width=W?Fw+2P/Sw+1
output height=H?Fh+2P/Sh+1
where swsw and shsh are horizontal and vertical stride of the convolution, respectively, and PP is the amount of zero padding added to the border of the image (Look at the previous post if you are not familiar with the zero padding concept). However, the output width or height calculated from these equations might be a non-integer value. In that case, you might want to handle the situation in any way to satisfy the desired output dimention. Here, we explain how Tensorflow approachs the issue. In general you have two main options for padding scheme which determine the output size, namely 'SAME' and 'VALID' padding schemes. In 'SAME' padding scheme, in which we have zero padding, the size of output will be
output height=ceil(H/Sh)
output width=ceil(W/Sw)
If the required number of pixels for padding to have the desired output size is a even number, we can simply add half of that to each side of the input (left and rigth or up and bottom). However, if it is an odd number, we need an uneven number of zero on the left and the right sides of the input (for horizontal padding) or the top and the bottom sides of the input (for vertical padding). Here is how Tensorflow calculates required padding in each side:
padding along height=Ph=max((output height?1)?Sh+Fh?H,0)
padding along width=Pw=max((output width?1)?Sw+Fw?W,0)
padding top=Pt=Floor(Ph/2)
padding left=Pl=Floor(Pw/2)
padding bottom=Ph?Pt
padding right=Pw?Pl
Similarly, in the 'VALID' padding scheme which we do not add any zero padding to the input, the size of the output would be
output height=ceil(H?Fh+1/Sh)
output width=ceil(W?Fw+1Sw)output height=ceil(H?Fh+1Sh)output width=ceil(W?Fw+1/Sw)
Let's get back to the Convolutional layer. A convolution layer does exactly the same: applying a filter on an input in convolutionl manner. Likewise Fully-Connected layers, a Convolutional layer has a weight, which is its kernel (filter), and a bias. But in contrast to the fully-connected layers, in convolutional layers each pixel (or neuron) of the output is connected to the input pixels (neurons) locally instead of being connected to all input pixels (neurons). Hence, we use the term of receptive field for the size of convolutional layer's filter.
Bias in a convolutional layer is a unique scalar value which is added to the output of Convolutional Layer's filter at every single pixel. What we talked about so far, was in fact a Convolutional layer with 1 input and 1 output channel (also known as depth) and a zero bias. Generally, a convolution layer can have multiple input channels (each a 2D matrix) and multiple output channels (again each a 2D matrix). Maybe the most tangible example of a multi-channel input is when you have a color image which has 3 RGB channels. Let's get it to a convolution layer with 3 input channels and 1 output channel. How is it going to cacluate the output? A short answer is that it has 3 filters (one for each input) instead of one input. What it does is that it calculates the convolution of each filter with its corresponding input channel (First filter with first channel, second filter with second channel and so on). The stride of all channels are the same, so they output matrices with the same size. Now, it sum up all matrices and output a single matrix which is the only channel at the output of the convolution layer.
What about when the convolution layer has more than one output channels. In that case, the layer has a different multi-channel filter (the number of its channel is equal to the number of input channels) to calculate each output. For example, assume we have a layer with three input channels (RGB) and five output channels. This layer would have 5 filters, and 3 channels per filter. It uses each filter (3 channels) to compute the corresponding output from the input channels. In other words, it uses the first 3-channel filter to calculate the first channel of the output and so on. Note that each output channel has its own bias. Therefore, the number of biases in each Convolutional layer is equal to the number of output channels. Now, let's modify the previous code to handle more than one channel at output.
number of parameters=(Fw×Fh×di+1)×do
where didi, and dodo are depth (# of channels) of the input and depth of the output, respectively. Note that the one inside the parenthesis is to count the biases.
3.
Convolutional Neural Networks (CNN, or ConvNets)
Convolutional Neural networks allow computers to see, in other words, Convnets are used to recognize images by transforming the original image through layers to a class scores. CNN was inspired by the visual cortex. Every time we see something, a series of layers of neurons gets activated, and each layer will detect a set of features such as lines, edges. The high level of layers will detect more complex features in order to recognize what we saw.
Input (the training data):
A part of the image is connected to the next Conv layer because if all the pixels of the input is connected to the Conv layer, It will be too computationally expensive. So we are going to apply dot products between a receptive field and a filter on all the dimensions. The outcome of this operation is a single integer of the output volume (feature map). Then we slide the filter over the next receptive field of the same input image by a
Parameter Sharing (shared weights): We think that if a feature is useful it will also be useful to look for it everywhere in the image. However, sometimes, it is weird to share the same weights in some cases. For example, in a training data that contains faces centered, we don’t have to look for eyes in the bottom or the top of the picture.
Dilation is a new hyperparameter introduced to the Conv layer. dilation is filters with spaces between its cells. for example, we have one dimension filter W of size 3 and an input X:
POOL layer:
Pool Layer performs a function to reduce the spatial dimensions of the input, and the computational complexity of our model. And it also controls overfitting. It operates independently on every depth slice of the input. There are different functions such as Max pooling, average pooling, or L2-norm pooling. However, Max pooling is the most used type of pooling which only takes the most important part (the value of the brightest pixel) of the input volume.
Fully_Connected Layer (FC):
Fully connected layers connect every neuron in one layer to every neuron in another layer. The last fully-connected layer uses a softmax activation function for classifying the generated features of the input image into various classes based on the training dataset.
Pool layer doesn’t have parameters (the weights and biases of the neurons), and no zero padding, but it has two hyperparameters: Filter (F) and Stride (S). More generally, having the input W1×H1×D1, the pooling layer produces a volume of size W2×H2×D2 where: