Skip to content
Generic filters
Exact matches only

12 Papers You Should Read to Understand Object Detection in the Deep Learning Era | by Ethan Yanjia Li | Aug, 2020

Ethan Yanjia Li
Free picture from Unsplash. Photography from Joanna Kosinska and edited by myself.

As the second article in the “Papers You Should Read” series, we are going to walk through both the history and some recent developments in a more difficult area of computer vision research: object detection. Before the deep learning era, hand-crafted features like HOG and feature pyramids are used pervasively to capture localization signals in an image. However, those methods usually can’t extend to generic object detection well, so most of the applications are limited to face or pedestrian detections. With the power of deep learning, we can train a network to learn which features to capture, as well as what coordinates to predict for an object. And this eventually led to a boom of applications based on visual perception, such as the commercial face recognition system and autonomous vehicle. In this article, I picked 12 must-read papers for newcomers who want to study object detection. Although the most challenging part of building an object detection system hides in the implementation details, reading these papers can still give you a good high-level understanding of where the ideas come from, and how would object detection evolve in the future.

As a prerequisite for reading this article, you need to know the basic idea of the convolution neural network and the common optimization method such as gradient descent with back-propagation. It’s also highly recommended to read my previous article “10 Papers You Should Read to Understand Image Classification in the Deep Learning Era” first because many cool ideas of object detection originate from a more fundamental image classification research.

OverFeat: Integrated Recognition, Localization and Detection using Convolutional Networks

From “OverFeat: Integrated Recognition, Localization and Detection using Convolutional Networks

Inspired by the early success of AlexNet in the 2012 ImageNet competition, where CNN-based feature extraction defeated all hand-crafted feature extractors, OverFeat quickly introduced CNN back into the object detection area as well. The idea is very straight forward: if we can classify one image using CNN, what about greedily scrolling through the whole image with different sizes of windows, and try to regress and classify them one-by-one using a CNN? This leverages the power of CNN for feature extraction and classification, and also bypassed the hard region proposal problem by pre-defined sliding windows. Also, since a nearby convolution kernel can share part of the computation result, it is not necessary to compute convolutions for the overlapping area, hence reducing cost a lot. OverFeat is a pioneer in the one-stage object detector. It tried to combine feature extraction, location regression, and region classification in the same CNN. Unfortunately, such a one-stage approach also suffers from relatively poorer accuracy due to less prior knowledge used. Thus, OverFeat failed to lead a hype for one-stage detector research, until a much more elegant solution coming out 2 years later.

Region-based Convolutional Networks for Accurate Object Detection and Segmentation

Also proposed in 2013, R-CNN is a bit late compared with OverFeat. However, this region-based approach eventually led to a big wave of object detection research with its two-stage framework, i.e, region proposal stage, and region classification and refinement stage.

From “Region-based Convolutional Networks for Accurate Object Detection and Segmentation

In the above diagram, R-CNN first extracts potential regions of interest from an input image by using a technique called selective search. Selective search doesn’t really try to understand the foreground object, instead, it groups similar pixels by relying on a heuristic: similar pixels usually belong to the same object. Therefore, the results of selective search have a very high probability to contain something meaningful. Next, R-CNN warps these region proposals into fixed-size images with some paddings, and feed these images into the second stage of the network for more fine-grained recognition. Unlike those old methods using selective search, R-CNN replaced HOG with a CNN to extract features from all region proposals in its second stage. One caveat of this approach is that many region proposals are not really a full object, so R-CNN needs to not only learn to classify the right classes, but also learn to reject the negative ones. To solve this problem, R-CNN treated all region proposals with a ≥ 0.5 IoU overlap with a ground-truth box as positive, and the rest as negatives.

Region proposal from selective search highly depends on the similarity assumption, so it can only provide a rough estimate of location. To further improve localization accuracy, R-CNN borrowed an idea from “Deep Neural Networks for Object Detection” (aka DetectorNet), and introduced an additional bounding box regression to predict the center coordinates, width and height of a box. This regressor is widely used in the future object detectors.

However, a two-stage detector like R-CNN suffers from two big issues: 1) It’s not fully convolutional because selective search is not E2E trainable. 2) region proposal stage is usually very slow compared with other one-stage detectors like OverFeat, and running on each region proposal separately makes it even slower. Later, we will see how R-CNN evolve over time to address these two issues.

Fast R-CNN

From “Fast R-CNN

A quick follow-up for R-CNN is to reduce the duplicate convolution over multiple region proposals. Since these region proposals all come from one image, it’s naturally to improve R-CNN by running CNN over the entire image once and share the computation among many region proposals. However, different region proposals have different sizes, which also result in different output feature map sizes if we are using the same CNN feature extractor. These feature maps with various sizes will prevent us from using fully connected layers for further classification and regression because the FC layer only works with a fixed size input.

Fortunately, a paper called “Spatial Pyramid Pooling in Deep Convolutional Networks for Visual Recognition” has already solved the dynamic scale issue for FC layers. In SPPNet, a feature pyramid pooling is introduced between convolution layers and FC layers to create a bag-of-words style of the feature vector. This vector has a fixed size and encodes features from different scales, so our convolution layers can now take any size of images as input without worrying about the incompatibility of the FC layer. Inspired by this, Fast R-CNN proposed a similar layer call the ROI Pooling layer. This pooling layer downsamples feature maps with different sizes into a fixed-size vector. By doing so, we can now use the same FC layers for classification and box regression, no matter how large or small the ROI is.

With a shared feature extractor and the scale-invariant ROI pooling layer, Fast R-CNN can reach a similar localization accuracy but having 10~20x faster training and 100~200x faster inference. The near real-time inference and an easier E2E training protocol for the detection part make Fast R-CNN a popular choice in the industry as well.

From “You Only Look Once: Unified, Real-Time Object Detection

This dense prediction over the entire image can cause trouble in computation cost, so YOLO took the bottleneck structure from GooLeNet to avoid this issue. Another problem of YOLO is that two objects might fall into the same coarse grid cell, so it doesn’t work well with small objects such as a flock of birds. Despite lower accuracy, YOLO’s straightforward design and real-time inference ability makes one-stage object detection popular again in the research, and also a go-to solution for the industry.

Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks

As we introduced above, in early 2015, Ross Girshick proposed an improved version of R-CNN called Fast R-CNN by using a shared feature extractor for proposed regions. Just a few months later, Ross and his team came back with another improvement again. This new network Faster R-CNN is not only faster than previous versions but also marks a milestone for object detection with a deep learning method.

From “Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks

With Fast R-CNN, the only non-convolutional piece of the network is the selective search region proposal. As of 2015, researchers started to realize that the deep neural network is so magical, that it can learn anything given enough data. So, is it possible to also train a neural network to proposal regions, instead of relying on heuristic and hand-crafted approach like selective search? Faster R-CNN followed this direction and thinking, and successfully created the Region Proposal Network (RPN). To simply put, RPN is a CNN that takes an image as input and outputs a set of rectangular object proposals, each with an objectiveness score. The paper used VGG originally but other backbone networks such as ResNet become more widespread later. To generate region proposals, a 3×3 sliding window is applied over the CNN feature map output to generate 2 scores (foreground and background) and 4 coordinates each location. In practice, this sliding window is implemented with a 3×3 convolution kernel with a 1×1 convolution kernel.

Although the sliding window has a fixed size, our objects may appear on different scales. Therefore, Faster R-CNN introduced a technique called anchor box. Anchor boxes are pre-defined prior boxes with different aspect ratios and sizes but share the same central location. In Faster R-CNN there are k=9 anchors for each sliding window location, which covers 3 aspect ratios for 3 scales each. These repeated anchor boxes over different scales bring nice translation-invariance and scale-invariance features to the network while sharing outputs of the same feature map. Note that the bounding box regression will be computed from these anchor box instead of the whole image.

From “Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks

So far, we discussed the new Region Proposal Network to replace the old selective search region proposal. To make the final detection, Faster R-CNN uses the same detection head from Fast R-CNN to do classification and fine-grained localization. Do you remember that Fast R-CNN also uses a shared CNN feature extractor? Now that RPN itself is also a feature extraction CNN, we can just share it with detection head like the diagram above. This sharing design doesn’t bring some trouble though. If we train RPN and Fast R-CNN detector together, we will treat RPN proposals as a constant input of ROI pooling, and inevitably ignore the gradients of RPN’s bounding box proposals. One walk around is called alternative training where you train RPN and Fast R-CNN in turns. And later in a paper “Instance-aware semantic segmentation via multi-task network cascades”, we can see that the ROI pooling layer can also be made differentiable w.r.t. the box coordinates proposals.

You Only Look Once: Unified, Real-Time Object Detection

While the R-CNN series started a big hype over two-stage object detection in the research community, its complicated implementation brought many headaches for engineers who maintain it. Does object detection need to be so cumbersome? If we are willing to sacrifice a bit of accuracy, can we trade for much faster speed? With these questions, Joseph Redmon submitted a network called YOLO to only four days after Faster R-CNN’s submission and finally brought popularity back to one-stage object detection two years after OverFeat’s debut.

From “You Only Look Once: Unified, Real-Time Object Detection

Unlike R-CNN, YOLO decided to tackle region proposal and region classification together in the same CNN. In other words, it treats object detection as a regression problem, instead of a classification problem relying on region proposals. The general idea is to split the input into an SxS grid and having each cell directly regress the bounding box location and the confidence score if the object center falls into that cell. Because objects may have different sizes, there will be more than one bounding box regressor per cell. During training, the regressor with the highest IOU will be assigned to compare with the ground-truth label, so regressors at the same location will learn to handle different scales over time. In the meantime, each cell will also predict C class probabilities, conditioned on the grid cell containing an object (high confidence score). This approach is later described as dense predictions because YOLO tried to predict classes and bounding boxes for all possible locations in an image. In contrast, R-CNN relies on region proposals to filter out background regions, hence the final predictions are much more sparse.

SSD: Single Shot MultiBox Detector

YOLO v1 demonstrated the potentials of one-stage detection, but the performance gap from two-stage detection is still noticeable. In YOLO v1, multiple objects could be assigned to the same grid cell. This was a big challenge when detecting small objects, and became a critical problem to solve in order to improve a one-stage detector’s performance to be on par with two-stage detectors. SSD is such a challenger and attacks this problem from three angles.

From “SSD: Single Shot MultiBox Detector

First, the anchor box technique from Faster R-CNN can alleviate this problem. Objects in the same area usually come with different aspect ratios to be visible. Introducing anchor box not only increased the amount of object to detect for each cell, but also helped the network to better differentiate overlapping small objects with this aspect ratio assumption.

From “SSD: Single Shot MultiBox Detector

SSD went down on this road further by aggregating multi-scale features before detection. This is a very common approach to pick up fine-grained local features while preserving coarse global features in CNN. For example, FCN, the pioneer of CNN semantic segmentation, also merged features from multiple levels to refine the segmentation boundary. Besides, multi-scale feature aggregation can be easily performed on all common classification networks, so it’s very convenient to swap out the backbone with another network.

Finally, SSD leveraged a large amount of data augmentation, especially targeted to small objects. For example, images are randomly expanded to a much larger size before random cropping, which brings a zoom-out effect to the training data to simulate small objects. Also, large bounding boxes are usually easy to learn. To avoid these easy examples dominating the loss, SSD adopted a hard negative mining technique to pick examples with the highest loss for each anchor box.

Feature Pyramid Networks for Object Detection

With the launch of Faster-RCNN, YOLO, and SSD in 2015, it seems like the general structure an object detector is determined. Researchers start to look at improving each individual parts of these networks. Feature Pyramid Networks is an attempt to improve the detection head by using features from different layers to form a feature pyramid. This feature pyramid idea isn’t very novel in computer vision research. Back then when features are still manually designed, feature pyramid is already a very effective way to recognize patterns at different scales. Using the Feature Pyramid in deep learning is also not a new idea: SSPNet, FCN, and SSD all demonstrated the benefit of aggregating multiple-layer features before classification. However, how to share the feature pyramid between RPN and the region-based detector is still yet to be determined.

From “Feature Pyramid Networks for Object Detection

First, to rebuild RPN with an FPN structure like the diagram above, we need to have a region proposal running on multiple different scales of feature output. Also, we only need 3 anchors with different aspect ratios per location now because objects with different sizes will be handle by different levels of the feature pyramid. Next, to use an FPN structure in the Fast R-CNN detector, we also need to adapt it to detect on multiple scales of feature maps as well. Since region proposals might have different scales too, we should use them in the corresponding level of FPN as well. In short, if Faster R-CNN is a pair of RPN and region-based detector running on one scale, FPN converts it into multiple parallel branches running on different scales and collects the final results from all branches in the end.

YOLO9000: Better, Faster, Stronger

While Kaiming He, Ross Girshick, and their team keep improving their two-stage R-CNN detectors, Joseph Redmon, on the other hand, was also busy improving his one-stage YOLO detector. The initial version of YOLO suffers from many shortcomings: predictions based on a coarse grid brought lower localization accuracy, two scale-agnostic regressors per grid cell also made it difficult to recognize small packed objects. Fortunately, we saw too many great innovations in 2015 in many computer vision areas. YOLO v2 just needs to find a way to integrate them all to become better, faster, and stronger. Here are some highlights of the modifications:

  • YOLO v2 added Batch Normalization layers from a paper called “Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift”.
Anchor boxes From “YOLO9000: Better, Faster, Stronger
  • Just like SSD, YOLO v2 also introduced Faster R-CNN’s idea of anchor boxes for bounding box regression. But YOLO v2 did some customization for its anchor boxes. Instead of predicting offsets to anchor boxes, YOLOv2 constraints the object center regression tx and ty within the responsible grid cell to stabilize early training. Also, anchors sizes are determined by a K-means clustering of the target dataset to better align with object shapes.
  • A new backbone network called Darknet is used for feature extraction. This is inspired by “Network in Network” and GooLeNet’s bottleneck structure.
  • To improve the detection of small objects, YOLO v2 added a passthrough layer to merge features from an early layer. This part can be seen as a simplified version of SSD.
  • Last but not least, Joseph realized that input resolution is a silver bullet for small object detection. It not only doubled the input for the backbone to 448×448 from 224×224 but also invented a multi-scale training schema, which involves different input resolutions at different periods of training.

Note that YOLO v2 also experimented with a version that’s trained on 9000 classes hierarchical datasets, which also represents an early trial of multi-label classification in an object detector.

Focal Loss for Dense Object Detection

To understand why one-stage detectors are usually not as good as two-stage detectors, RetinaNet investigated the foreground-background class imbalance issue from a one-stage detector’s dense predictions. Take YOLO as an example, it tried to predict classes and bounding boxes for all possible locations in the meantime, so most of the outputs are matched to negative class during training. SSD addressed this issue by online hard example mining. YOLO used an objectiveness score to implicitly train a foreground classifier in the early stage of training. RetinaNet thinks they both didn’t get the key to the problem, so it invented a new loss function called Focal Loss to help the network learn what’s important.

From “Focal Loss for Dense Object Detection

Focal Loss added a power γ (they call it focusing parameter) to Cross-Entropy loss. Naturally, as the confidence score becomes higher, the loss value will become much lower than a normal Cross-Entropy. The α parameter is used to balance such a focusing effect.

From “Focal Loss for Dense Object Detection

This idea is so simple that even a primary school student can understand. So to further justify their work, they adapted the FPN model they previously proposed and created a new one-stage detector called RetinaNet. It is composed of a ResNet backbone, an FPN detection neck to channel features at different scales, and two subnets for classification and box regression as detection head. Similar to SSD and YOLO v3, RetinaNet uses anchor boxes to cover targets of various scales and aspect ratios.

A bit of a digression, RetinaNet used the COCO accuracy from a ResNeXT-101 and 800 input resolution variant to contrast YOLO v2, which only has a light-weighted Darknet-19 backbone and 448 input resolution. This insincerity shows the team’s emphasis on getting better benchmark results, rather than solving a practical issue like a speed-accuracy trade-off. And it might be part of the reason that RetinaNet didn’t take off after its release.

YOLOv3: An Incremental Improvement

YOLO v3 is the last version of the official YOLO series. Following YOLO v2’s tradition, YOLO v3 borrowed more ideas from previous research and got an incredible powerful one-stage detector like a monster. YOLO v3 balanced the speed, accuracy, and implementation complexity pretty well. And it got really popular in the industry because of its fast speed and simple components. If you are interested, I wrote a very detailed explanation of how YOLO v3 works in my previous article “Dive Really Deep into YOLO v3: A Beginner’s Guide”.

From “Dive Really Deep into YOLO v3: A Beginner’s Guide”
From “Dive Really Deep into YOLO v3: A Beginner’s Guide”

Simply put, YOLO v3’s success comes from its more powerful backbone feature extractor and a RetinaNet-like detection head with an FPN neck. The new backbone network Darknet-53 leveraged ResNet’s skip connections to achieve an accuracy that’s on par with ResNet-50 but much faster. Also, YOLO v3 ditched v2’s pass through layers and fully embraced FPN’s multi-scale predictions design. Since then, YOLO v3 finally reversed people’s impression of its poor performance when dealing with small objects.

Besides, there are a few fun facts about YOLO v3. It dissed the COCO mAP 0.5:0.95 metric, and also demonstrated the uselessness of Focal Loss when using a conditioned dense prediction. The author Joseph even decided to quit the whole computer vision research a year later, because of his concern of military usage.

Although the image classification area becomes less active recently, object detection research is still far from mature. In 2018, a paper called “CornerNet: Detecting Objects as Paired Keypoints” provided a new perspective for detector training. Since preparing anchor box targets is a quite cumbersome job, is it really necessary to use them as a prior? This new trend of ditching anchor boxes is called “anchor-free” object detection.

From “Stacked Hourglass Networks for Human Pose Estimation

Inspired by the use of heat-map in the Hourglass network for human pose estimation, CornerNet uses a heat-map generated by box corners to supervise the bounding box regression. To learn more about how heat-map is used in Hourglass Network, you can read my previous article “Human Pose Estimation with Stacked Hourglass Network and TensorFlow”.

From “Objects as Points

Objects As Points, aka CenterNet, took a step further. It uses heat-map peaks to represent object centers, and the network will regress the box width and height directly from these box centers. Essentially, CenterNet is using every pixel as grid cells. With a Gaussian distributed heat-map, the training is also easier to converge compared with previous attempts which tried to regress bounding box size directly.

The elimination of anchor boxes also has another useful side effect. Previously, we rely on IOU ( such as > 0.7) between the anchor box and the ground truth box to assign training targets. By doing so, a few neighboring anchors may get all assigned a positive target for the same object. And the network will learn to predict multiple positive boxes for the same object too. The common way to fix this issue is to use a technique called Non-maximum Suppression (NMS). It’s a greedy algorithm to filter out boxes that are too close together. Now that anchors are gone and we only have one peak per object in the heat-map, there’s no need to use NMS any more. Since NMS is sometimes hard to implement and slow to run, getting rid of NMS is a big benefit for the applications that run in various environments with limited resources.

EfficientDet: Scalable and Efficient Object Detection

From “EfficientDet: Scalable and Efficient Object Detection

In the recent CVPR’20, EfficientDet showed us some more exciting development in the object detection area. FPN structure has been proved to be a powerful technique to improve the detection network’s performance for objects at different scales. Famous detection networks such as RetinaNet and YOLO v3 all adopted an FPN neck before box regression and classification. Later, NAS-FPN and PANet (please refer to Read More section) both demonstrated that a plain multi-layer FPN structure may benefit from more design optimization. EfficientDet continued exploring in this direction, eventually created a new neck called BiFPN. Basically, BiFPN features additional cross-layer connections to encourage feature aggregation back and forth. To justify the efficiency part of the network, this BiFPN also removed some less useful connections from the original PANet design. Another innovative improvement over the FPN structure is the weight feature fusion. BiFPN added additional learnable weights to feature aggregation so that the network can learn the importance of different branches.

From “EfficientDet: Scalable and Efficient Object Detection

Moreover, just like what we saw in the image classification network EfficientNet, EfficientDet also introduced a principled way to scale an object detection network. The φ parameter in the above formula controls both width (channels) and depth (layers) of both BiFPN neck and detection head.

From “EfficientDet: Scalable and Efficient Object Detection

This new parameter results in 8 different variants of EfficientDet from D0 to D7. A light-weighed D0 variant can achieve similar accuracy with YOLO v3 while having much fewer FLOPs. A heavy-loaded D7 variant with monstrous 1536×1536 input can even reach 53.7 AP on COCO that dwarfed all other contenders.

From R-CNN, YOLO to recent CenterNet and EfficientDet, we have witnessed most major innovations in the object detection research in the deep learning era. Aside from the above papers, I’ve also provided a list of additional papers for you to keep reading to get a deeper understanding. They either provided a different perspective for object detection or extended this area with more powerful features.

2009: DPM

Object Detection with Discriminatively Trained Part Based Models

By matching many HOG features for each deformable parts, DPM was one of the most efficient object detection models before the deep learning era. Take pedestrian detection as an example, it uses a star structure to recognize the general person pattern first, and then recognize parts with different sub-filters and calculate an overall score. Even today, the idea to recognize objects with deformable parts is still popular after we switch from HOG features to CNN features.

2012: Selective Search

Selective Search for Object Recognition

Like DPM, Selective Search is also not a product of the deep learning era. However, this method combined so many classical computer vision approaches together, and also used in the early R-CNN detector. The core idea of selective search is inspired by semantic segmentation where pixels are group by similarity. Selective Search uses different criteria of similarity such as color space and SIFT-based texture to iteratively merge similar areas together. And these merged area areas served as foreground predictions and followed by an SVM classifier for object recognition.

2016: R-FCN

R-FCN: Object Detection via Region-based Fully Convolutional Networks

Faster R-CNN finally combined RPN and ROI feature extraction and improved the speed a lot. However, for each region proposal, we still need fully connected layers to compute class and bounding box separately. If we have 300 ROIs, we need to repeat this by 300 hundred times, and this is also the origin of the major speed difference between one-stage and two-stage detector. R-FCN borrowed the idea from FCN for semantic segmentation, but instead of computing the class mask, R-FCN computes a positive sensitive score maps. This map will predict the probability of the appearance of the object at each location, and all locations will vote (average) to decide the final class and bounding box. Besides, R-FCN also used atrous convolution in its ResNet backbone, which is originally proposed in the DeepLab semantic segmentation network. To understand what is atrous convolution, please see my previous article “Witnessing the Progression in Semantic Segmentation: DeepLab Series from V1 to V3+”.

2017: Soft-NMS

Improving Object Detection With One Line of Code

Non-maximum suppression (NMS) is widely used in anchor-based object detection networks to reduce duplicate positive proposals that are close-by. More specifically, NMS iteratively eliminates candidate boxes if they have a high IOU with a more confident candidate box. This could lead to some unexpected behavior when two objects with the same class are indeed very close to each other. Soft-NMS made a small change to only scaling down the confidence score of the overlapped candidate boxes with a parameter. This scaling parameter gives us more control when tuning the localization performance, and also leads to a better precision when a high recall is also needed.

2017: Cascade R-CNN

Cascade R-CNN: Delving into High Quality Object Detection

While FPN exploring how to design a better R-CNN neck to use backbone features Cascade R-CNN investigated a redesign of R-CNN classification and regression head. The underlying assumption is simple yet insightful: the higher IOU criteria we use when preparing positive targets, the less false positive predictions the network will learn to make. However, we can’t simply increase such IOU threshold from commonly used 0.5 to more aggressive 0.7, because it could also lead to more overwhelming negative examples during training. Cascade R-CNN’s solution is to chain multiple detection head together, each will rely on the bounding box proposals from the previous detection head. Only the first detection head will use the original RPN proposals. This effectively simulated an increasing IOU threshold for latter heads.

2017: Mask R-CNN

Mask R-CNN

Mask R-CNN is not a typical object detection network. It was designed to solve a challenging instance segmentation task, i.e, creating a mask for each object in the scene. However, Mask R-CNN showed a great extension to the Faster R-CNN framework, and also in turn inspired object detection research. The main idea is to add a binary mask prediction branch after ROI pooling along with the existing bounding box and classification branches. Besides, to address the quantization error from the original ROI Pooling layer, Mask R-CNN also proposed a new ROI Align layer that uses bilinear image resampling under the hood. Unsurprisingly, both multi-task training (segmentation + detection) and the new ROI Align layer contribute to some improvement over the bounding box benchmark.

2018: PANet

Path Aggregation Network for Instance Segmentation

Instance segmentation has a close relationship with object detection, so often a new instance segmentation network could also benefit object detection research indirectly. PANet aims at boosting information flow in the FPN neck of Mask R-CNN by adding an additional bottom-up path after the original top-down path. To visualize this change, we have a ↑↓ structure in the original FPN neck, and PANet makes it more like a ↑↓↑ structure before pooling features from multiple layers. Also, instead of having separate pooling for each feature layer, PANet added an “adaptive feature pooling” layer after Mask R-CNN’s ROIAlign to merge (element-wise max of sum) multi-scale features.

2019: NAS-FPN

NAS-FPN: Learning Scalable Feature Pyramid Architecture for Object Detection

PANet’s success in adapting FPN structure drew attention from a group of NAS researchers. They used a similar reinforcement learning method from the image classification network NASNet and focused on searching the best combination of merging cells. Here, a merging cell is the basic build block of an FPN that merges any two input features layers into one output feature layer. The final results proved the idea that FPN could use further optimization, but the complex computer-searched structure made it too difficult for humans to understand.

Object detection is still an active research area. Although the general landscape of this field is well shaped by a two-stage detector like R-CNN and one-stage detector such as YOLO, our best detector is still far from saturating the benchmark metrics, and also misses many targets in complicated background. At the same time, Anchor-free detector like CenterNet showed us a promising future where object detection networks can become as simple as image classification networks. Other directions of object detection, such as few-shot recognition and NAS, are still at an early age, and we will see how it goes in the next few years. Nevertheless, as object detection technology becomes more mature, we need to be very cautious about its adoption by the military and police. A dystopia where Terminators hunt and shoot humans with a YOLO detector is the last thing we want to see in our life.

Originally published at on Aug 9, 2020