A convolutional neural network, also know as CNN or ConvNet, is a feed-forward neural network that is generally used to analyze visual images by processing data with grid-like topology. A CNN is used to detect and classify objects in an image. In CNNs, every image is represented in the form of an array of pixel values.
The convoluion operation forms the basis of any CNN. In a convolution operation, the arrays are multiplied element-wise, and the dot product is summed to create a new array, which represents Wx
.
A Convolution neural network has multiple hidden layers that help in extracting information from an image. The four important layers in CNN are:
- Convolution layer
- ReLU layer
- Pooling layer
- Fully connected layer (also called Dense layer)
Convolution layer
This is the first step in the process of extracting valuable features from an image. A convolution layer has several filters that perform the convolution operation. Every image is considered as a matrix of pixel values.
Consider a black and white image of 5x5 size whose pixel values are either 0 or 1 and also a filter matrix with a dimension of 3x3. Next, slide the filter matrix over the image and compute the dot product to get the convolved feature matrix.
ReLU layer
Once the feature maps are extracted, the next step is to move them to a ReLU layer. ReLU (Rectified Linear Unit) is an activation function which performs an element-wise operation and sets all the negative pixels to 0. It introduces non-linearity to the network, and the generated output is a rectified feature map. The relu function is: f(x) = max(0,x)
.
Pooling layer
Pooling is a down-sampling operation that reduces the dimensionality of the feature map. The rectified feature map goes through a pooling layer to generate a pooled feature map.
Imagine a rectified feature map of size 4x4 goes through a max pooling filter of 2x2 size with stride of 2. In this case, the resultant pooled feature map will have a pooled feature map of 2x2 size where each value will represent the maximum value of each stride.
The pooling layer uses various filters to identify different parts of the image like edges, shapes etc.
Fully Connected layer
The next step in the process is called flattening. Flattening is used to convert all the resultant 2D arrays from pooled feature maps into a single linear vector. This flattened vector is then fed as input to the fully connected layer to classify the image.
Convolutional Neural Networks in a nutshell
- The pixels from the image are fed to the convolutional layer that performs the convolution operation
- It results in a convolved map
- The convolved map is applied to a ReLU function to generate a rectified feature map
- The image is processed with multiple convolutions and ReLU layers for locating the features
- Different pooling layers with various filters are used to identify specific parts of the image
- The pooled feature map is flattened and fed to a fully connected layer to get the final output
Links:
- Learn CNN in the browser
Add notes from the video (PRs are welcome)
- convolutional neural networks (CNN) consist of different types of layers
- convolutional layer as filters (i.e. 5 x 5)
- dense layers
- sliding the filter across the cells of the entire image
- calculating similarity scores of the different positions of the filter
- creating the feature map, one feature map per filter
- chaining layers of simple and complex filters allows the CNN to "learn"
- resulting in vector representation of image
- activation functions sigmoid for binary classification and softmax for multiclass
The notes are written by the community. If you see an error here, please create a PR with a fix. |