To build a single stage model we can apply the same approach as what we saw in single stage models for object localization task just that here we need to detect multiple objects and there we had only one object per image.
Now this brings with it a big issue i.e. since you don’t know beforehand how many objects you need to detect implies you don’t know how many neurons will be present in your output layer …. it will be different for different images during the test time!! This could have been done given that we know how many boxes we need to predict but in object detection we don’t know how many objects we will have to find in the test image.
To address this problem, many solutions were proposed :
- Fix no. of bounding boxes
Researchers said we will predict a set no of bounding boxes eg 100 bounding boxes for each image irrespective of how many actual objects are present in the image. But obviously this is not a good approach because what if the test image has 150 objects? We will miss out tagging 50 objects …. - YOLO-V1
- Single Shot Detection (SSD)
We will not go indepth into this because this is outdated, but if you still want to understand this then you can refer this tutorial. - YOLO-V2
- YOLO-V3
- YOLO-V4
- YOLO-V5
- ….
- YOLO-V10
- Detection Transformers (DETR) by Facebook
Now instead of CNNs in all the above architectures people have also started using Transformer. The model architecture is shown below. I am still trying to understand and implement this architecture… You are reference the paper here, but instead I prefer this video tutorial and this code implementation tutorial here