GSoC Chronicles — Mightier than SSD

Pulkit Mishra
9 min readAug 25, 2020


This week the work mainly focussed on training and deploying the object detection model for images and videos. This blog aims to document how exactly this has been implemented and all the things that I learned during this phase. Firstly, we have a gentle introduction to the metrics associated with evaluating object detection methods. After that, I have experimented with a different method to document my learnings in this stage and have listed them in the form of comparative studies. This is followed by explaining why YOLOv3 has been picked for implementing in Poor Man’s Rekognition and how it has been deployed.

YOLO Harold!


Before we get into comparing One-Stage Detectors with Two-Stage Detectors and comparing YOLO, YOLOv2, YOLOv3 let’s first define the metric of comparison that we are going to use: mAP

In order to do that, we need to first talk about Intersection over Union (IoU) first. It is defined as the ratio of the area of Intersection to the area of Union of two boxes, one of which represents the ground truth and the other represents the predictions made by the model as shown in the left.

It is worth highlighting that IOU will always be ≥0 and ≤1.

Next, we need to talk about Precision and Recall. Let’s say that we have 20 images and we have 120 cats in these 20 images and the model is able to detect 100 cats in total. Out of these 100 cats, 70 are detected correctly and the rest 30 are incorrect predictions. The correctness or incorrectness of predictions is determining by setting a threshold to the IoU value for that prediction.

Ratio of True Positives to Total Predicted Positive
Ratio of True Positives to Total Positive Ground Truths

Correct Car Predictions = 70

Total Car Predictions = 100

Total Cars = 120

Precision= 7/10 & Recall= 7/12

AP is Average Precision and it is defined as the area under the Precision-Recall curve thus combining both Precision and Recall. The mean of all the AP for all the classes is called the mean Average Precision (mAP).

One-Stage Detector vs Two-Stage Detector

There are two common approaches to object detection: one-shot and two-shot detection. The two-shot approach as is clear from the name has two stages: firstly we have the model proposing a set of regions of interests by select search or regional proposal network and this is followed by the classification of those regions followed by refinement of the location prediction. Single-shot detection skips the region proposal stage and runs detection directly over a dense sampling of possible locations yielding final localization and content prediction at once. Models belonging to the R-CNN family are all region-based following the two-stage approach whereas single-shot multibox detector (SSD) and YOLO are the popular single-shot approaches.

Worth highlighting is R-FCN (Region-Based Fully Convolutional Networks) which is another popular two-shot meta-architecture, inspired by Faster-RCNN. It is a sort of hybrid between the single-shot and two-shot approach. In this approach, a Region Proposal Network (RPN) proposes candidate RoIs (region of interest), which are then applied on score maps. All learnable layers are convolutional and computed on the entire image. Almost all of the different proposed regions’ computation is shared. The per-RoI computational cost is negligible compared with Fast-RCNN.

Making the correct tradeoff between speed and accuracy when building a given model for a target use-case is an important decision that needs to be addressed for any deep learning model that is supposed to be deployed to production. While two-shot detection models achieve better performance, single-shot detection is in the sweet spot of performance and speed/resources. A one-shot detector trains faster and has swifter inference than a two-shot detector. Faster training allows the researcher to efficiently prototype & experiment without consuming considerable expenses for cloud computing. More importantly, the fast inference property is typically a requirement when it comes to real-time applications. As it involves less computation, it, therefore, consumes much less energy per prediction.

YOLO across generations

As mentioned above, while there are object detectors that require several passes over the image or have a two-stage pipeline, YOLO on the other hand only needs to look once at the image to detect all the objects and hence the name: You Only Look Once

NOTE: YOLOv4 and YOLOv5 are not covered due to the ongoing controversy around these.


YOLO divides the input image into SxS grid. In the original implementation, YOLO chooses S=7. YOLO runs a classification and localization problem to each of the 7x7=49 grid cells simultaneously with each grid detecting one object each. The maximum number of objects that can be detected in an image is thus clocked at 49. Not only that, but YOLO also suffers from the problem of close object detection disallowing more than one object from getting detected in each of these cells. These problems can be solved by predicting B boxes instead of 1 and IoU can serve as the confidence score of each of these boxes. In the original implementation, we had B=2, thus a total of 98 bounding boxes were getting predicted, and of course, the bounding boxes with less confidence were discarded. Another problem is that the same object can be detected in multiple grids and this, in particular, can be solved by non-maximum suppression.

Thus, YOLO returns conditional probabilities of the 20 classes it was trained on for each grid cell. Only the probability from the bounding box with a higher confidence score is returned as the other one gets discarded anyway.

Along with the conditional probabilities, YOLO also returns the (x,y,h,w) for each predicted bounding box and here the x,y are the coordinates of the center of the bounding box with respect to the grid cell and h,w are the height and weight relative to the whole image.

Thus, the predictions are encoded as S ×S ×(B ∗5 + Classes) tensor. For S=7, B=2, and Classes=20 this will give us a 7x7x30 tensor.

Original YOLO network is inspired by GoogLeNet, however, YOLO VGG-16 that uses VGG-16 is also popular along with Fast YOLO that only has 9 Convolutional layers instead of 24.


YOLOv2 seeks to deal with the issue of low recall in YOLOv1 and also improves the localization error. This is achieved by firstly adding Batch Normalization on all of the convolutional layers. Secondly, in YOLOv1 the classification network was trained at 224x224 image resolution and 448x448 for detection network. Thus, YOLO when switching from classification to detection, had to simultaneously switch to learning object detection and adjust to the new input resolution. In YOLOv2 before switching to detection, the network is first fine-tuned for 448x448 images for 10 epochs giving the network time to adjust its filters to work better on higher resolution input. Further, YOLOv2 borrows the idea of Anchor Boxes from Faster R-CNN to predict k bounding boxes instead of just 1. These anchor boxes are defined with respect to each of the grid cells and their locations and dimensions are obtained using k means clustering using IoU as the defining metric. YOLOv2 has 5 anchor boxes for each grid to serve as a healthy tradeoff between model complexity and high recall.

YOLOv2 also shifts the backbone of the model to Darknet-19 which is to be used as the classifier. It has 19 convolutional layers, 5 max-pooling layers and the output shape is 13 x 13 x (k x (1+4+20)) where k is the number of anchor boxes and 20 is the number of classes. For k=5 the output shape is 13x13x125.


YOLO9000 seeks to increases the detected classes from 20 to 9000 by jointly optimizing detection and classification. In YOLOv2 we had model separately learning classification and then detection as the dataset fro both was different. However, in YOLO9000 images from both datasets are mixed and are marked as either detection or classification. When the network sees an image labeled for detection we can backpropagate based on the full YOLOv2 loss function. When it sees a classification image we only backpropagate loss from the classificationFor these 1369 predictions we don’t compute one softmax but we compute a separate softmax overall synsets that are hyponyms of the same concept. specific parts of the architecture. This approach has a flaw that the detection datasets tend to be smaller and shallower(less variety) than classification datasets. Thus there is a need to merge these datasets by using the concept of Wordtree.

1000 classes are taken from the Imagenet and 369 intermediate nodes are added to it as shown in the left making the output layer of Darknet-19 increase to 1369.

For these 1369 predictions, instead of computing one softmax, separate softmax is computed for all synsets that are hyponyms of the same concept. This results in a tree of probabilities. The tree is traversed from top to down, taking the highest confidence path at every split until a node with probability < threshold is reached and that node is predicted.


YOLOv3 has few incremental improvements on YOLOv2 such as a better feature extractor like DarkNet-53 with shortcut connections and a better object detector with feature map upsampling and concatenation. Bounding box predictor is mostly the same as YOLOv2 with (tx, ty, tw, th) getting predicted. Objectness score is also predicted using logistic regression. It is 1 if the bounding box prior overlaps a ground truth object by more than any other bounding box prior. For class prediction, instead of softmax independent logistic classifiers and binary cross-entropy loss are used because there may be overlapping labels for multilabel classification such as in Open Images Dataset. The performance of YOLO had so far been extremely poor on small objects, YOLOv3 seeks to improve that by introducing short cut connections.

YOLOv3 uses a new and much deeper network called Darknet-53 for performing feature extraction. The new network is a hybrid between the network used in YOLOv2(Darknet-19), and residual network, so it has some short cut connections.

YOLOv3 predicts boxes at 3 different scales instead of predicting it at the final layer only as in the earlier versions. Features are extracted from these scales like in Feature Pyramid Network. Several convolutional layers are added to the base feature extractor Darknet-53. The last of these layers predicts the bounding box, objectness, and class predictions.

The feature map is then taken from 2 layers previous and is upsampled by 2. A feature map is also taken from earlier in the network and merged with the upsampled features using concatenation. This is the typical encoder-decoder architecture, that is used to evolve SSD into DSSD which allows us to get more meaningful semantic information from the upsampled features and finer-grained information from the earlier feature map.

Thus, it can be seen that YOLOv3 is much better than SSD and has similar performance as DSSD. YOLO not only outperforms earlier versions of itself in detecting small objects but also beats two-stage Faster R-CNN variants. However, the performance on middle and large objects is clearly not as good.

Implementing YOLOv3 in PMR

  1. Django REST API for object detection created (PR #101)
  2. Object Detection from videos (PR #104)





Pulkit Mishra

I write peoms and code:)