Now what machine learning algorithm is used for computer vision? That would be Convolutional Neural Networks or CNNs. CNNs are designed to process images and can either be found as independent models or as parts of other neural networks. They can work as image pre-processors for other models, like multimodal language models. Similar to how neural networks were designed to mimic brain functionality, CNNs are designed to mimic a human’s visual processing system and the brain’s visual cortex.
2.1: Convolutional Layer
Convolutional neural networks are structured as a series of layers, similar to neural networks. The first layers are called the convolutional layers. Convolutional layers apply a mathematical calculation to a section of an image to extract information about the features in that image. Recall that a feature is an attribute of interest, such as an ear, number of legs, presence of color, etc. It does this using an object called a kernel. A kernel works like a sort of filter, amplifying the effect of some pixels and minimizing the effect of others to try and draw out a feature. For example, the kernel in the image below is attempting to extract an “ear” feature from a section of the input image.
The solution to the calculation is then placed back into a new, usually smaller, matrix that will ultimately become a representation of where the features we are searching for within the image exist. We call this a feature map. For example, if there is an ear in the top left of an image, then an “ear” kernel applied to that image will result in a feature map that has high values in the top left where the ear was.
The formula for the convolutional layer’s calculation is a dot product calculation, where a filter (a.k.a kernel) F is applied to an image in sections (the number of sections depends on the size of the image and the size of the filter) as seen below. We will use the image above for examples as we break down the equation.
The variables i and j denote where in the image we are looking at the moment. If we input (0, 0), the top left of the filter would be placed at (0, 0). This is what is happening in the above figure. The size of the filter is a k × k matrix (a 2 × 2 matrix in the above figure). So when we place it on the image, we get a k × k “focus zone” where we are looking for a feature.
Variables m and n within the summations denote the coordinates of pixels in both image I and the filter. We iterate through them, multiplying the pixel value of a coordinate within the image by the value within the filter that “overlaps” it. You can see this being done in the above image, as we multiply the pair 3 × 2. The 3 comes from the image and the 2 comes from the overlapping spot in the filter. We repeat this for all pixels in the “focus zone”, summing them all up.
The coordinate c designates the channel of the image and filter the calculation is currently focusing on. Colored images have 3 channels, one that stores the R (red) pixel value, one that stores the G (green) pixel value, and one that stores the B (blue) pixel value. We would apply a filter to each of these channels separately. This allows us to detect features through the dimension of color as well (e.g. “red ear”). The filters applied to each channel can be the same, or different, depending on the architecture of the CNN. There are also other encodings for colored images, such as HSV or CMYK. Black and white images can be encoded in just one channel.
Finally, Z(i, j) returns the value that we will store in the feature map described previously.
Along with kernels, two other factors affect how the feature map Z is created, stride and padding. Stride denotes how many pixels the filter is shifted by during the calculation. The kernel can be shifted by only 1 pixel meaning there will be overlap in the sections of the image the kernel is placed on. Or, with a stride that equals the width of the kernel, the sections that the kernel is applied to do not overlap. In the above image, the stride is 2. Observe more examples of strides in Figures 3 and 4 below.
Padding, on the other hand, refers to extra pixels around the edges of a kernel. These extra pixels, usually having a value of 0, control the size of the outputted feature map. With padding, we can manipulate the size of the feature map in multiple ways, including leaving the feature map with the same dimensions as the input matrix, while still applying the kernel as a filter.
2.2: Nonlinearity in CNNs
CNNs, like all neural networks, employ nonlinearity to model the complex
patterns in their data that cannot be captured by linear transformations alone. Nonlinearity is vital for neural networks — without it, all their calculations would be the equivalent of a single linear operation. Using non-linear functions breaks this linearity, enabling the model to approximate complex, non-linear decision boundaries, which is essential for tasks like image recognition, object detection, and natural language processing. In CNNs, this nonlinearity is employed after the application of the kernel and with activation functions. Using non-linearity after convolutional layers aids the network in learning hierarchical features of increasing complexity. Early layers learn simple
features like edges and textures, while deeper layers combine these to detect more abstract features such as objects or shapes.
CNNs commonly use the Rectified Linear Unit or ReLU activation function. ReLU, as seen in the formula below, is quick, efficient, and requires little computation power.
ReLU leaves all positive values unchanged and changes all negative values to 0. This method prevents certain neurons from activating, making the model more efficient and less prone to overfitting. However, stopping neurons from activating also has disadvantages, such as certain features or neural connections ‘dying out’. This means some neurons will never learn and advance, since their gradients are reset to 0 using this activation function. In order to address these disadvantages, models sometimes use LeakyReLU. LeakyReLU uses the formula:
where α denotes a small positive constant, usually 0.01. In this activation function, negative values are suppressed instead of reset. So neurons will be able to activate, but their activation will be quieter and negative values will continue to have less effect on the final conclusion.
2.3: Pooling Layer
Now, after the convolutional layer, the features in the image are amplified and complex calculations have been used to manipulate the image to extract details from it. Once those calculations are complete, another layer manipulates the image further, summarizing all these little details brought out by the convolutional layer. This is called the pooling layer. The pooling layer simplifies the feature map outputted by the convolutional layer, while retaining the significant features. This reduces the amount of parameters that move on to the next layer and the amount of computation necessary, making the model more efficient. Any further operations done by the model will be done on the ‘pooled’ matrix, with features summarized and simplified through the use of a 2-dimensional filter. For a feature map having dimensions h × w × c, the dimensions of the map after pooling would be as seen below.
Note that f is the size of the filter used and s denotes the length of the stride used.
A common technique used for the pooling layer is max pooling. This operation takes the maximum value in a given section of the feature map and selects that number to represent the section in the summarized map, as seen in the figure below.
If we designate Z(i, j) as the section of the map that the max pooling operation is being done on, we get the following formula for the calculation on Z(i, j). This calculation is done assuming a section size of 2 × 2.
Average pooling, another pooling operation, uses a similar methodology. Rather than taking the maximum value, however, average pooling chooses the average value of the section rather than the maximum as representation in the pooling matrix. It’s Z_pool formula for a 2 × 2 section is as follows
While pooling layers are extremely beneficial in making models more efficient, they have a few disadvantages. As you can see in both the max and average pooling operations, pooling causes significant information loss. Pooling layers minimize the data, meaning we lose information in the process. This loss can cause excess ‘smoothing’ of the image, where finer details are lost.
2.4: Fully Connected Layer
The dense or fully connected layer is the last layer of a Convolutional Neural Network. Typically, CNNs employ several convolutional and pooling layers before the dense layer, to extract and identify all the necessary features before making a conclusion. Before the input from the last convolutional or pooling layer can be passed to the dense layer, it is flattened into a one-dimensional vector. These dense layers are just neural networks, and, in the cases where the CNN is integrated into
another network, they are the neural network processing the features extracted by the CNN. The fully connected layers, or the model in the other case, perform regression or classification operations on the given input to draw a conclusion based on the data. For a single-dimensional input vector x, a weight matrix W, and a vector of bias terms of each neuron b, the formula for the vector of outputs z would be
The dense layer also typically uses an activation function when doing a multi-level classification operation. This activation function takes the logits from the previous layer and converts them into probabilities between 0 and 1. The softmax activation function, as seen below, is typically used for this specific operation.
In the formula, z_i denotes the raw logits calculated by the dense layers for class i. After applying the softmax operation to raw logits (can hold any numerical value), we get a “probability distribution” over all classes. We then select the one with the highest probability, and say that the image given to the CNN belongs to class i.
2.5: Loss Functions, Optimizers, and Regularization
Like all neural networks, CNNs also use loss functions, optimizers, and regularization to improve their accuracy. They use many of the functions common to neural networks, such as Cross-Entropy Loss and Mean Squared Error for loss functions, gradient descent for optimization, and dropout for regularization. See the figure below for an overview of the entire architecture of a Convolutional Neural Network.