Skip to content
Search
Generic filters
Exact matches only

Analysis and Applications of Multi-Scale CNN Feature Maps

Mohammad Sanatkar

In this blog post, we present a formal treatment of receptive fields of convolution layers and characterizations of multi-scale convolutional feature maps using a derived mathematical framework. Using the developed mathematical framework, we compute the receptive fields and spatial scales of feature maps under different convolutional and pooling operations. We show the significance of pooling operations to ensure the exponential growth of spatial scale of feature maps as a function of layer depths. Also, we observe that without pooling operations embedded into CNNs, feature map spatial scales only grow linearly as layer depth increase. We introduce spatial scale profile as the layer-wise spatial scale characterization of CNNs which could be used to assess the compatibility of feature maps with histograms of object dimensions in training datasets. This use case is illustrated by computing the spatial scale profile of ResNet-50. Also, we explain how feature pyramid module generates multi-scale feature maps enriched with augmented semantic representations. Finally, it is shown while dilated convolutional filters preserve the spatial dimensions of feature maps, they maintain greater exponential growth rate of spatial scales compared to their regular convolutional filter counterparts.

Reading this blogpost, you will have a deeper insight into the intuitions behind the use cases of multi-scale convolutional feature maps in the recent proposed CNN architectures for variety of vision tasks. Therefore, this blogpost can be treated as a tutorial to learn more about how different types of layers impact the spatial scales and receptive fields of feature maps. Also, this blogpost is for those engineers and researchers that are involved in designing CNN architectures and are tired of blind trial and error of which feature maps to choose from a CNN backbone to improve the performance of their models, and instead, prefer from the early steps of design process, to match the spatial scale profiles of feature maps with the object dimensions in training datasets. To facilitate such use cases, we have made our code base publicly available at https://github.com/rezasanatkar/cnn_spatial_scale.

It is a general assumption and understanding that feature maps generated by the early convolutional layers of CNNs encode basic semantic representations such as edges and corners, whereas deeper convolutional layers encode more complex semantic representations such as complicated geometric shapes in their output feature maps. Such a characteristic of CNNs to generate feature maps with multi semantic levels is resultant of their hierarchical representational learning ability which is based on multi-layer deep structures. Feature maps with different semantic levels are critical for CNNs because of the two following reasons: (1) complex semantics feature maps are built on top of basic semantic feature maps as their building blocks (2) a number of vision tasks like instance and semantic segmentation benefit from both basic and complex semantic feature maps. A vision CNN-based architecture takes an image as input, and passes it through several convolutional layers with the goal of generating semantic representations corresponding to the input image. In particular, each convolution layer outputs a feature map, where the extent of the encoded semantics in that feature map depends on both the representational learning ability of that convolutional layer as well as its previous convolutional layers.

CNN Feature Maps are Spatial Variance

One important characteristic of CNN feature maps is that they are spatial variance, meaning that CNN feature maps have spatial dimensions, and a feature encoded by a given feature map might only become active for a subset of spatial regions of the feature map. In order to better understand the spatial variance property of CNN feature maps, first, we need to understand why the feature maps generated by fully connected layers are not spatial variance. The feature maps generated by fully connected layers (you can thinks of the activations of neurons of a given fully connected layer as its output feature map) do not have spatial dimensions since every neuron of a fully connected layer is connected to all the input units of the fully connected layer. Therefore, it is not possible to define and consider a spatial aspect for a neuron activation output.

On the other hand, every activation of a CNN feature map is only connected to a few input units, which are in each other spatial neighborhood. This property of CNN feature maps gives rise to their spatial variance characteristic, and is resultant from the spatial local structure of convolution filters and their spatially limited receptive fields. The differences between fully connected layers and convolutional layers which result in spatial invariance for one and spatial variance for the other one is illustrated in the below image where the input image is denoted by the green rectangle, and the brown rectangle denotes a convolutional feature map. Also, a fully connected layer with two output neurons is denoted by two blue and grey circles. As you can see, each neuron of the fully connected layer is impacted by all the image pixels whereas each entry of feature map is only impacted by a local neighborhood of input pixels.

This figure illustrates why the features generated by fully connected layers are not spatial variance while convolutional layers generate spatial variance feature maps. The green rectangle denotes the input image and the brown rectangle denotes a feature map with dimension 5 x 7 x 1 generated by a convolution layer of a CNN. On the other hand, the two blue and grey circles denote activation outputs of a fully connected layer with two output neurons. Let assume the blue neuron (feature) of the fully connected layer will become active if there is a bicycle in the input image while its grey neuron (feature) will become active if there is car in the input image. In other words, the blue neuron is the bicycle feature while the grey neuron is the car feature. Because of the nature of fully connected layers which each neuron’s output is impacted by all the input image pixels, the fully connected layers’ generated features cannot encode any localization information out-of-the-box in order to tell us where in the input image the bicycle is located if there is a bicycle in input image. On the other hand, the feature maps generated by convolutional layers are spatial variance and therefore, they encode the localization information in addition to the existence information of objects. In particular, a generated feature map of dimension W x H x C by a convolutional layer, contains information of existence of C different features (each channel, the third dimension of the feature map, encode existence information of a unique feature) where the spatial dimension W x H of features tell us for which location of the input image, the feature is activated. In this example, the brown convolutional feature map only encodes one feature since it has only one channel (its third dimension is equal to one). Assuming this brown feature map is the bicycle feature map, then an entry of this feature map becomes active only if there is a bicycle in the receptive field of that entry in input image. In other words, this entry does not become active if there is a bicycle in the input image but not in its specific receptive field. Such property of convolutional feature maps enable them not only to encode information about the existence of objects in input images, but also to encode localization information of objects.

Spatial Scales and Spatial overlaps of CNN Feature Maps

In this section, we formally define the spatial scales of CNN feature maps. CNN feature maps having spatial dimensions, their spatial scales are used to compute the spatial mappings between their entries and input image regions. In particular, the spatial scale of an entry of a given CNN feature map is defined as the pixel-wise size of the rectangle subregion of the input image that impact the value of that feature map entry. The simplest case is to compute the spatial scale of the first layer. For example, if the first layer of a CNN is a 3 x 3 convolutional layer, then the spatial scales of entries of the first layer feature map, are 3 x 3 pixel subregions of input images. Computing spatial scales of output feature maps of deeper CNN layers requires knowing the spatial scales and spatial overlaps of their input feature maps. In the latter sections, we derive the spatial scale and overlap formulas for output feature maps of different convolutional and pooling layers in terms of the their input feature maps’ spatial scales and spatial overlaps.

Here, spatial overlaps is defined as the percentage of overlaps between spatial scales of two neighboring feature map entries. In the below image, we plot the spatial scale and overlap of two neighboring feature map entries where the green rectangle denote the input image and the brown rectangle denotes a feature map generated by one of the convolutional layer of the CNN. The spatial scale of the orange feature map entry is the shaded orange region over the input image and the spatial scale of the blue feature map entry is the blue shaded region over the input image. The spatial overlap between the neighboring blue and orange entries is the overlapping area over the input image between the two spatial scale regions.

The green rectangle denotes the input image whereas the brown rectangle denotes a CNN feature map with dimension 5 x 7 x 1 generated by one of the convolutional layers of the CNN network. In this figure, two neighboring entries of this feature map are marked with orange and blue colors. The spatial scale of the orange feature map entry is drawn as the orange shaded region on the input image whereas the spatial scale of the blue feature map is drawn as the blue shaded area over the input image. As you can see, the blue and orange spatial scale rectangles have same size, and overlap with each other. The ratio of the size of overlapping area between these two neighboring spatial scale rectangles to the size of a spatial scale region is defined as the spatial overlap.

As an example for computing spatial overlap, let assume the first layer of a CNN is a 3 x 3 convolutional layer with stride of 1. Then, the first layer generated feature maps will have spatial scales of 3 x 3 and spatial overlaps of 6 pixels out of 9 pixels, which is about 67% spatial overlap. This percentage of overlap is because of choosing stride to be equal to 1. If the stride is chosen to be 2, then the spatial overlap will be equal to 33%, and stride of 3 results in 0% spatial overlap. In general, we should avoid very high spatial overlap percentage simply because the feature map entries will end up encoding redundant information with the cost of computation and memory resources. On the other hand, very low spatial overlap percentage results in aliasing effect in entries of generated feature maps.

In this section, we discuss how CNNs rely on pooling operations to ensure the exponential growth rate of feature map spatial scales as a function of layer depths. First, we show that not embedding pooling operations in CNNs and only relying on convolution layers with stride 1 results in linear growth rate of spatial scales of feature maps. Then, pooling layers are shown to realize exponential growth rate of feature map spatial scales. Finally, a case study of two variants of CNNs with and without pooling operations is presented to illustrate the impact of using pooling operations on spatial scale growth rate of feature maps.

Here, we show that a CNN consisting of only 3 x 3 convolutional layers with stride 1 can only demonstrates linear growth rate for feature map spatial scales. The spatial scale of the first layer feature map will be 3 x 3, simply due to the 3 x 3 receptive field over the input image. The spatial scale of the feature map output of the second layer 3 x 3 convolution with stride 1 will be equal to 5 x 5. It is because the receptive field of the second layer feature map is 3 x 3 with respect to the first layer feature map while the receptive field of the first layer feature map is 3 x 3 as well but with respect to the input image. Therefore, if you combine these two 3 x 3 receptive fields sequentially and consider the spatial overlaps of the receptive fields of neighboring entries of the first layer feature map, you can compute that the spatial scale of the second layer feature map with respect to the input image is 5 x 5. This is shown in the below image where for simplification, the input image is assumed to be 1 dimensional and the convolutional layers are assumed to 1 x 3 with stride 1. It can be observed that the spatial scale of the blue entry of the second layer feature map is 1 x 5 with respect to the input image.

In this figure, we illustrate spatial scales of feature maps generated by 1 x 3 convolutional layers with stride 1 applied on an 1 dimensional input image. The green rectangle denotes the one dimensional input image where the brown rectangle shows three neighboring feature map entries output of the first convolution layer and the blue rectangle shows a single feature map entry generated by the second convolutional layer. As you can see, the spatial scale of the first layer feature map entries is 1 x 3 while the spatial scale of the second layer feature map entries is 1 x 5.

Based on the above example and induction, it can be shown that adding each 3 x 3 convolutional layer with stride 1, will increase the spatial scale of feature maps by 2 pixels along each dimension. Let S(n) denote the spatial scale of the nth layer feature map. Then, we can write S(n) = (3 + 2 * (n-1)) x (3 + 2 * (n-1)), which shows that the spatial scale growth rate with respect to layer depths is linear. This linear growth rate of spatial scales is mainly because of the significant spatial overlaps of neighboring feature map entries. The linear growth rate of spatial scales will be an issue if the vision task for which this network is used, requires relatively large spatial scale feature maps. For example, in order to increase the spatial scale of the final feature map (the feature map generated by the last CNN layer of the network) to 40 x 40 pixels of the input image, then the CNN network requires about 20 3 x 3 CNN layers. Because of the computational cost of convolutional layers during training and inference, in most cases, we prefer to avoid adding more and more convolutional layers just for the sake of meeting spatial scale requirement of feature maps.

Pooling Operations

In most of the CNN architectures, convolutional layers are interleaved with pooling operations to increase the growth rate of spatial scales of feature maps via reducing the spatial overlap between feature map entries. Increasing the growth rate of spatial scales beyond linear growth rate, allows us to save adding more convolutional layers just for the sake of larger spatial scales. We can realize pooling operations via either explicit pooling layers or choosing the stride of convolutional filters to be greater than one. The most common pooling layer is 2 x 2 max pooling layer with stride 2. This pooling layer halves the spatial dimensions of feature maps, which means that it transforms an input feature map of size W x H x C (W x H denotes the spatial dimension of the feature map and C denotes the number of channels of the feature map) to an output feature map of size W/2 x H/2 x C. The main motivation of halving the spatial dimensions of feature maps is to reduce the spatial overlap between neighboring feature map entries.

Spatial Scale and Overlap of 2 x 2 Pooling Layers with Stride 2

In this section, we derive spatial scale and overlap formulas for output feature maps of 2 x 2 pooling layers with stride 2 in terms of the spatial scale and overlap of their input feature maps. Let S and P denote the spatial scale and spatial overlap of the input feature map, respectively. Also, let S` and P` denote the spatial scale and spatial overlap of the output feature map of the max pooling layer. We can compute S`(S,P) = S + 2(1-P)S + (1-P)² S=(2-P)²S. The S` equation is derived based on the fact that the spatial scale of a feature map entry generated by this max pooling layer will be equal to aggregation of spatial scales of its 4 corresponding input neighboring feature map entries where (1-P) and (1-P)² factors ensure to only count once every sub-region of the union spatial scales of 4 input entries.

In order to further clarify the derivation of the spatial scale formula for 2 x 2 pooling layers, we demonstrate the reasoning behind it in the below image. The green rectangle denotes the input image while the brown rectangle denotes the input feature map (with the dimension 5 x 7 x 1) to the pooling layer and the blue rectangle denotes the output feature map (with the dimension 3 x 4 x 1) of the pooling layer. To illustrate the derivation of S`, we consider a single entry (marked with grey color) of the output feature map. The spatial scale of the grey entry depends on the spatial scales and overlaps of its 4 corresponding input feature map entries as it show in the below image. Without loss of generality, we can assume that the first term S in S` = S + 2(1-P)S + (1-P)² S, is corresponding to the spatial scale of the orange entry of the input feature. Also, the second term 2(1-P)S take into account the spatial scales of the blue and purple entries of the input feature map where the factor (1-P) is to make sure that they do not overlap with the spatial scale of the orange entry. Finally, the term (1-P)² S is corresponding to the black entry of the input feature map where the factor (1-P)² is to ensure the the effective spatial scale of the black entry do not overlap with the already counted spatial scales of the orange, blue and purple entries.

Demonstration of derivation of spatial scale formula for 2 x 2 pooling layers. The green rectangle denotes the input image; the brown rectangle denotes the input feature map to the pooling later and the output feature map of the pooling layer is denoted by the blue rectangle. The spatial scale of the orange output entry is the union of the 4 regions marked on the input image which are corresponding to the spatial scales of the 4 neighboring input feature map entries.

On the other hand, the absolute value of spatial overlap between each two neighboring entries after applying 2 x 2 max pooling will be equal to 2PS-P²S=P(2-P)S, where 2PS comes from the fact that 2 x 2 receptive fields of the two neighboring output entries are adjacent via two pairs of input entries. Finally, we need to subtract P²S to avoid counting the overlapping subregoins twice.

The above reasoning to compute the absolute spatial overlap for 2 x 2 max pooling layers with stride 2 is illustrated in the below image where the green rectangle denotes the input image while the brown rectangle denotes the input feature map to the pooling layer and the blue rectangle denotes the output feature map of the pooling layer. In this illustration, without loss of generality, we focus on the spatial overlaps of two output neighboring entries marked with the grey and orange colors. Note that the receptive field of these two entries do not have explicit overlap because of the stride 2. However, they have effective overlap over their spatial scales based on their input adjacent pairs. In particular, the blue input entry of the grey output entry is adjacent to the green input entry of the orange output entry while the black input entry of the grey output entry is adjacent to the red input entry of the orange output entry. The absolute overlap of 2PS-P² S is marked as the purple dashed area over the input image. Each of adjacent pairs (blue, green) and (black, red) contribute to a PS term in the final value of absolute spatial scale while the subtraction of P²S is to ensure that we do not count the overlapping spatial scales twice.

Demonstration of derivation of spatial overlap for 2 x 2 pooling layers with stride 2. The green rectangle denotes the input image while the brown rectangle denotes the input feature map and the blue rectangle denotes the output feature map. The absolute spatial overlap between the neighboring grey and orange output entries is shown as a dashed purple area over the input image.

The spatial overlap P` is equal to the absolute spatial overlap P(2-P)S divided by S`. Therefore, we have P`(S, P) = P(2-P)S/((2-P)²S)=P/(2-P). This result is significant since it shows that P` is always smaller than P which concludes that the 2 x 2 pooling layer with stride 2 will always reduce the spatial overlap between neighboring feature map entries. For example, if the spatial overlap before applying the pooling operation has been 2/3, then it will reduce to 2/4 after applying the pooling.

Spatial Scale and Overlap of 3 x 3 Convolutional Layers with Stride 1

Here, we rely on the same line of reasoning as discussed in the previous section, to derive spatial scale and overlap formulas for 3 x 3 convolutional layers with stride 1. Let S and P denote the spatial scale and overlap of input feature map entries. Also, let S` and P` denote the spatial scale and spatial overlap of feature map entries at the output of the 3 x 3 convolutional filters with stride 1. Then, we can compute the spatial scale for 3 x 3 convolutional layers as S`(S, P) = S + 4(1-P)S + 4(1-P)²S. Next, we focus on computing the spatial overlap. In particular, computing the absolute spatial overlap of output feature map of 3 x 3 convolutional filters with stride 1 is more straightforward than computing the spatial overlap for 2 x 2 pooling layers with stride 2, and it is demonstrated in the below image.

In the below image, the green rectangle denotes the input image while the brown and blue rectangles denote the input and output feature maps of 3 x 3 convolutional layer with stride 1. Note that stride 1 causes the spatial dimension of feature maps to not change, and therefore the spatial dimensions of both input and output feature maps are 4 x 6 x 1. Here, without loss of generality, we focus on computing the spatial overlap between the neighboring grey and orange output feature map entries. Because of stride 1, the overlapping receptive fields of these two output entries over the input feature map is 3 x 2 and is denoted by the shaded purple region over the brown input feature map.

This 3 x 2 explicit overlap region simplifies computing the absolute value of spatial overlap between the grey and orange output entries. In particular, their absolute spatial overlap is equal to the aggregated spatial scales of the input feature map entries belonging to the 3 x 2 purple overlapping area. It can be shown that the absolute spatial overlap is equal to S + 3(1-P)S + 2(1-P)²S. In this example, the term S which is marked as a shaded blue region over the input image, is corresponding to the entry on the second row and first column of the purple shaded region. Also, the 3 terms (1-P)S are marked with black, red and green regions over the input image, and are corresponding to the top, right and bottom entries with respect to the entry on the second row and first column. Finally, the 2 terms (1-P)² S are marked as brown regions over the input image and are corresponding to the top-right and bottom-right entries of the purple overlapping area. Therefore, the spatial overlap P`(S,P) = (S + 3(1-P)S + 2(1-P)²S) / S` = (1 + (1-P)(5–2P))/(1+4(1-P)(2-P)).

Demonstration of the computation of spatial overlap of output feature maps of 3 x 3 convolutional layers with stride 1. The green rectangle denotes the input image while the brown rectangle denotes the input feature map to 3 x 3 convolutional layer and the blue rectangle denotes the output feature map of 3 x 3 convolutional layer. The shaded purple area over the input brown feature map denotes the 3 x 2 overlapping area of receptive fields of two neighboring output feature map entries grey and orange. The absolute spatial overlap between these two neighboring output entries is marked as the union of 6 regions (blue, black, red, green, brown and brown) over the input image.

Case Study – CNNs with 20 3 x 3 Convolutional Layers

Using the derived spatial scale and overlap formulas for 3 x 3 convolutional layers with stride 1 and 2 x 2 max pooling with stride 2, we show the significance of applying pooling layers in order to increase the effective spatial scale of feature map entries. For our example, we consider the two following variations of 20 layers CNN: (1) a 20 layer CNN with 20 3 x 3 CNN layers with stride 1 and without pooling layers, the blue curve in the below image (2) a 20 layer CNN with 20 3 x 3 CNN layers with stride 1 interleaved with 2 x 2 max pooling layers with stride 2 every 4 CNN layers, the red curve in the below image.

In the below plotted curves, the x-axis is the layer depth and y-axis is the spatial scale width (width of spatial scale is equal to square root of spacial scale) of the feature map entries generated by a CNN layer at the depth specified on x-axis. As you can see, in both cases, the spatial scale of feature map entries increases as layer depth increases. However, the spatial scale growth rate for the CNN with pooling layers (the red curve) is exponential whereas the spatial scale growth rate for the CNN without pooling layers (the blue curve) is linear. The exponential growth rate of spatial scales for the CNN with pooling layers results in its final feature map entries to have spatial scales of 243 x 243 while the spatial scale of the final feature map entries of the CNN without pooling layers only grows to 50 x 50. This means that the feature maps generated by CNN with pooling layers can encode objects as large as 243 x 243 pixels capture in input images while the CNN without pooling layers is only able to encode objects as large as 50 x 50 pixels.

Comparison between spatial scale of two CNNs where both have 20 3×3 CNN layers; red curve: 2×2 max pooling every 4 3×3 CN layers; blue curve: only 3×3 CNN layers and no pooling. x-axis: layer depth; y-axis: layer-specific spatial scale width. Spatial scale growth rate of the CNN with pooling is exponential while the spatial scale growth rate of the CNN without pooling is linear.

While designing CNN architectures, it is necessary to aggregate the statistics of object dimensions in the training datasets and to examine that the designed CNN architectures’ feature maps provide spatial scales that span all the major mods of object dimensions histograms. We call the layer-wise spatial scale of a CNN, its spatial scale profile. As a best practice, it is necessary to first design the spatial scale profile based on the histograms of training dataset’s object dimensions and then design the CNN architecture according to the spatial scale profile requirements.

As an example, here, we compute the spatial scale profile of ResNet-50 (50-layer). As we mentioned before, the pooling operations in CNN architectures could be realized using either explicit pooling layers or choosing the stride of convolutional filters to be greater than one. For ResNets (Deep Residual Learning for Image Recognition), both CNN convolutional filters with strides greater than one and explicit pooling layers are used to reduce the spatial overlap between neighboring feature map entires. Among the 5 presented ResNet architectures (18-layer, 34-layer, 50-layer, 101-layer and 152-layer shown in the below table) in Deep Residual Learning for Image Recognition, we derive the spatial scale profile for ResNet-50. However, the presented results can be readily extended to the other 4 configurations. In ResNet50, three types of pooling operations are used: (1) 7 x 7 convolutional filters with stride 2 (2) 3 x 3 max pooling layer with stride 2 (3) 1 x 1 convolutional filters with stride 2.

ResNet-18, ResNet-34, ResNet-50, ResNet-101 and ResNet-152 configurations. For ResNet-50, downsampling is performed in the following order: the first layer (7×7 convolutional layer with stride 2), the second layer (3×3 max pooling layer with stride 2) and 1×1 convolutional layers with stride 2 as the first layers of conv3_1, conv4_1 and conv5_1 blocks.

The 7 x 7 convolutional layer with stride 2 is only being used as the first layer of ResNet-50. Therefore, its output spatial scale and overlap computation can be simplified. In particular, the spatial scale of each feature map entry output of 7 x 7 convolutional filters is 7 x 7 since they are directly applied to input images. Also, the absolute spatial overlap between neighboring feature map entries is equal to 7 x 5. Therefore, its corresponding spatial overlap will be equal to (7 x 5)/(7 x 7) = 5/7.

The 3 x 3 max pooling layer with stride 2 is only being used as the second layer of ResNet-50. Let S and P denote the spatial scale and overlap of input feature map entries, respectively, and S` and P` denote the spatial scale and spatial overlap of output feature map entries. The spatial scale of 3 x 3 pooling layers is equal to the spatial scale of 3 x 3 convolutional layers. Therefore, we have S`(S,P) = S + 4(1-P)S + 4(1-P)²S. On the other hand, we need to derive the spatial overlap of 3 x 3 pooling layers with stride 2 using the same techniques that we already discussed for the other layers. We can compute the spatial overlap to be as: P`(S,P) = (1+ 2(1-P)) / (1+4(1-P)(2-P)).

The 1 x 1 convolutional layers with stride 2 are being used at different depth of ResNet-50 as the main pooling operation to reduce the spatial overlap between neighboring feature map entries and to ensure the exponential growth rate of spatial scale of feature map entries. The 1 x 1 convolutional filters having receptive filed of 1 x 1, they do not change the spatial scale of feature map entries while being applied to feature maps. However, having stride of 2, they modify the spatial overlap of the feature map entries. Let P and P` denote the spatial overlaps of feature map entries before and after applying 1 x 1 convolutional layers with stride 2, respectively. Then, we can compute P`(S,P) = 2 max(P – 0.5, 0). This implies that if the spatial overlap of input feature map entries is less than 0.5, the spatial overlap of output feature map entries will be zero since the receptive field of 1 x 1 convolutional filters is only 1 x 1 and stride 2 ensures that their neighboring output entries do not have overlapping receptive fields. Therefore, in order to have non-zero spatial overlap, it muse be carried out from the previous layers of the network.

The below image plots the spatial scale profile of ResNet-50 where x-axis spans the layers of ResNet-50 and y-axis denotes the spatial scale width (spatial scale width is simply equal to square root of spatial scale) at each layer of ResNets-50. As it can be observed, the pooling operations used in ResNet-50 results in exponential growth rate of spatial scale of ResNet-50 as a function of layer depths. In particular, the final feature map generated by ResNet-50 has spatial scale of 483 x 483 pixels which implies that its feature map entries encode representation of objects as large as 483 x 483 pixels embedded in input images. Also, we observe that the spatial scale of ResNet-50 is piece-wise constant, which is caused by 1 x 1 convolutional layers with stride 1 that are mainly used in ResNet-50 to channel-wise dimensionality reduction of feature maps.

Spatial scale profile of ResNet-50 where x-axis spans over the layers of ResNet-50 and y-axis denotes the spatial scale of each layer of ResNet-50. Spatial scale profile of ResNet-50 manifests exponential growth of layer-wise spatial scale as a function of layer depth.

In this section, we discuss why the majority of CNN architectures for different vision tasks rely on multi-scale feature maps to improve their detection, classification and regression performance. The answer is rather simple and straightforward. The objects captured in natural images manifest themselves in variety of pixel-wise dimensions. For example, if a car is 5 meter away from the camera, then it appears much larger in the image, compared to the case if it was 50 meter away from the camera. Therefore, for this example, we need two feature maps with different spatial scales where one is suitable when the car is 5 meter away from the camera and the other one for when the car is 50 meter away from the camera. Note even though the feature map with the larger spatial scale (matching the car image when it is 5 meter away from the camera) is also able to encapsulate the image of the car when it is 50 meter away from the camera, it does not provide accurate representation for the 50 meter away car since its spatial scale is not tight enough around the image car and contains significant leakage of information from the other objects in the scene.

The issue of objects appearing in images with different pixel-wise dimensions is know as scale ambiguity of images, and along with occlusion and camera point-of-view variance, complicates the vision-based detection systems. The most well-known non-ML based method designed to address scale ambiguity of images is Scale-invariant feature transform (SIFT). Arguably, SIFT has been the main inspiration of deep learning multi-scale feature map image encoders. SIFT relies on image pyramid (resizing input images to different scales and aggregating detected key points after processing each scaled version of input image) and Gaussian smoothing filters with different σ² to detect blobs with different scales, as the key points of input images. Detecting blobs with different scales causes SIFT to generate reliable image encodings while presence of scale ambiguity in natural images.

The semantic representations generated by a given CNN corresponding to an input image is the union of all feature maps generated by each convolutional layer of the CNN, and not only the final feature map. Relying on several feature maps provide the networks with different spatial scales.

Multi-Scale Feature Maps for Object Detection Models

It is safe to say that in the area of CNN-based vision models, the strength of multi-scale feature maps manifested itself initially for 2d object detection models. In particular, SSD: Single Shot MultiBox Detector relies on VGG-16 as the backbone network (feature generator) to encode input images via 6 feature maps with different spatial scales. Different spatial scales of the feature maps allows SSD to detect objects with different pixel-wise dimensions embedded in input images.

As another example, Feature Pyramid Networks for Object Detection (FPN) uses ResNet-50 and ResNet-101 as the backbone networks to generate multi-scale feature maps. In the case of ResNet-50, the feature map outputs of conv2_3 (11th layer), conv3_4 (23rd layer), conv4_6 (41st layer) and conv5_3 (50th layer) are chosen to provide feature maps with different spatial scales. Using the ResNet-50’s spatial scale profile that we presented in the previous section, we can compute the spatial scales of output feature maps of conv2_3, conv3_4, conv4_6 and conv5_3 as 35 x 35, 99 x 99, 291 x 291 and 483 x 483, respectively. Therefore, the selected feature maps can represent object with different sizes appeared in input images.

However, the main contribution of FPN is to enhance the semantic representation capability of shallower layer feature maps, using the semantic information encoded in deeper layer feature maps. The main weakness of feature maps generated by shallow layers is that they are not semantically as rich as the feature maps generated by deeper layers. It is because the process of semantic encoding of input images into feature maps is a hierarchical process, meaning that the basic semantics appear in the early layer feature map, while the more complex semantics appear in the feature maps of deeper layers.

To better understand why this will be an issue for object detection models, consider the following example with ResNet-50 as the backbone network. Assume a car is 50 meter away from the camera and it is captured in input image via a box of 35 by 35 pixels and therefore the feature map generated by conv2_3 (11th layer) is the best candidate to encode it. Also, assume a second car that is 5 meter away from the camera and its corresponding box in the input image has dimension of 480 by 480 pixels and therefore conv5_4 (50th layer) suits best to encode it. It is reasonable to assume that both these cars have similar semantic complexity, independently from their sizes in input images. However, we know that 11the layer of ResNet-50 is not semantically as rich as the 50th layer of ResNet-50.

FPN solves this issue using a top-down path that transfers the semantics encoded by deeper layers to shallower layer via nearest-neighbor upsampling operation to ensure that the spatial dimensions of transferred deeper feature maps match the spatial dimension of shallower feature maps (shallower layer feature maps have larger spatial dimensions that feature maps of deeper layers). This process is demonstrated in the below image. Merging of semantics from deeper layer and shallower layers is realized by element-wise addition of upsampled feature maps of deeper layers and transformed versions of shallower layer feature maps under 1 x 1 lateral convolutional layers. Finally, to ensure that feature maps generated by these element-wise addition operation do no suffer from aliasing effect, they are filtered using 3 x 3 convolutional layers.

FPN module to transfer the rich semantics of deeper layers to shallower layers via top-down branch on the right-side.

In this section, we focus on dilated convolution layers (Multi-Scale Context Aggregation by Dilated Convolutions) which first introduced in 2016. Dilated convolutional filters different from regular convolution filters, perform sparse sampling of the input feature maps to increase the spatial scale of their output feature maps. Therefore, you can think of them as an alternative to pooling layers and convolutional layers with stride greater than one, in order to increase the spatial scales of CNN feature maps. Dilated convolutional filters perform sparse sampling of input feature maps based on holes embedded in their convolutional filters different from regular convolutional filters that do not any have holes and rely on dense local sampling of the input feature maps.

To better understand how dilated convolutional filters are applied to input feature maps, in the below image, we show convolution operation of a regular 3 x 3 convolution filter on the left (red marks) versus convolution operation of a dilated 3 x 3 convolution filter with dilation rate of 2 on the right (blue marks). As you can see, the dilated convolution filter performs sparse sampling of its input feature map by skipping (the skips are referred as holes) every other input entries in each dimension. In other words, dilation rate of 2 means that the convolution filter samples only one entry per each 2 x 2 region of the input feature map. Therefore, in general, dilation rate of n is equivalent of sampling one entry per each n x n region of the input feature maps. Note that the regular convolution filters are special case of dilated convolution filters with dilation rate of 1.

left: regular 3 x 3 convolution filter, right: dilated 3 x 3 convolution filter with dilation rate of 2

Dilated convolutions rely on the holes in their receptive fields to increase the spatial scale of their output feature map entries. In particular, the holes (skips) in the receptive fields of dilated convolution filters reduces the spatial overlap between sampled input feature map entries, which causes an increase in the spatial scale of output feature map entries. In the above example, the dilated convolution filter with dilation rate of 2, skips one feature map entry between every two consecutive sampled entries which results in decrease in spatial overlap between sampled feature map entries. The effective spatial scales of dilated convolution filters will be larger than the spatial scales of regular convolution filters if there is non-zero spatial overlap between neighboring input feature map entries.

Here, first, we derive formula to compute the spatial scale and overlap of dilated 3 x 3 convolution filters with stride 1 and dilation rate of 2 and then, we compare the growth rate of their spatial scales with the grow rate of regular 3 x 3 convolutional filters with stride 1 and stride 2. Let S and P denote the spatial scale and overlap of input feature map, respectively. Also, let S` and P` denote the spatial scale and overlap of output feature maps of dilated 3 x 3 convolution filters with stride of 1 and dilation rate of 2, respectively. Then, we can compute S`(S,P) = 9S – 24Max(P – 0.5, 0)S. For the special case when P < 0.5, then S` will be equal to 9 times of the input spatial scale S, which is the maximum increase in spatial scale that we can expect from dilated 3 x 3 convolution filters with dilation rate of 2. In order to compute P`, two formulas are derived based on whether P < 0.5 or P >0.5. If P < 0.5, then P`(S,P) = 15P / [9 — 24Max(P — 0.5, 0)] and if P > 0.5, then P`(S,P) = [6 + 3(1- P)- (14 + 4(1- P))Max(P – 0.5, 0)] / [9–24Max(P — 0.5, 0)].

Next, we compare the spatial scale growth rates of dilated 3 x 3 convolutional filters with stride of 1 and dilation rate of 2 versus regular 3 x 3 convolutional filters with strides of 1 and 2. In the below image, we plot this comparison in terms of layer depths. The blue curves refers to a 5 layer convolution network where all the 5 layers are dilated convolution layers with stride 1 and dilation rate of 2. The red curve is a 5 layer convolution network composed of 5 regular 3 x 3 convolutional layers with stride 2, and the green curve is corresponding to a 5 layer CNN network formed by 5 regular 3 x 3 convolutional layers with stride 1. Y-axis denotes the layer-wise spatial scale width where spatial scale width is equal to the square root of the spatial scale.

As it can be observed, both dilated 3 x 3 convolutional layers with stride 1 and dilation rate 2, and regular 3 x 3 convolutional layers with stride 2 show exponential growth rate of spatial scale as a function of layer depth, whereas the growth rate of regular 3 x 3 convolutional layers with stride 1 is linear. That being said, the spatial scale growth rate of the dilated 3 x 3 convolution layer is greater than the spatial scale growth rate of the regular 3 x 3 convolution layer with stride 2. In particular, the spatial scale of the final feature map of the dilated convolution layer is 243 x 243 pixels whereas the spatial scales of the final feature maps of the 3 x 3 convolution layers with stride 2 and stride 1 are equal to 63 x 63 and 11 x 11, respectively.

Comparison of layer-wise spatial scale width of dilated 3 x 3 convolutional layer with stride 1 and dilation rate of 2, regular 3×3 convolutional layer with stride 2 and regular 3 x 3 convolutional layer with stride 1. X-axis denotes layer depth while y-axis denotes the spatial scale width corresponding to each layer. While both dilated 3 x 3 convolutional layer with stride 1 and dilation rate of 2 and regular 3 x 3 convolutional layer with stride 2 show exponential growth rate of spatial scales as a function of layer depths, the regular 3 x 3 convolutional layer with stride 1 only manifests a linear growth rate for spatial scale.

Even though regular 3 x 3 convolutional filters with stride 2 show exponential growth rate of spatial scales similar to the dilated convolutional filters, the regular convolutional filters with stride 2 result in halving the spatial dimension (width and height) of feature maps each time they are applied to feature maps. As we mentioned earlier, the pooling operations such as explicit pooling layers with stride greater than 1 and convolutional layers with stride greater than 1 rely on reducing the spatial dimensions of feature maps in order to increase the spatial scale of the entries of feature maps. On the other hand, the dilated convolution filters preserve the spatial dimension of feature maps based on their stride 1. Preserving spatial dimensions of feature maps while simultaneously increasing their spatial scales with exponential growth rate by dilated convolution filters make them to be compatible for vision tasks with dense predictions.

Examples of vision tasks with dense predictions are semantic segmentation, instance segmentation, depth estimation and optical flow. For example, in the case of semantic segmentation, the goal is to make predictions for the class of each input image pixel. In other words, if the input image is W x H pixels, then a CNN designed for semantic segmentation is expected to generate W x H predictions for the classes of pixels. CNNs aiming at such dense predictions require feature maps with higher resolutions (larger spatial dimension) than CNNs that perform sparse detection tasks like image classification. For a sparse prediction vision task like image classification, CNNs only need to output a single prediction for the whole input image, and therefore high resolution feature maps are not as essential as for dense prediction tasks.

For dense prediction vision tasks, each pixel corresponding prediction should be based on both the information embedded in the local neighborhood of the pixel as well as the global information referred as the image context. For example, consider the monocular (single camera) depth estimation task, global information like a pixel being part of a building provides the network with rough estimation of depth of the pixel, while local features like texture help the network to further refine its estimation for the the depth of the pixel. Preserving the local neighborhood information of a pixel in the feature maps requires the convolutional layers to preserve the spatial dimensions of feature maps, and that’s why dilated convolutional filters with stride 1 are essential for dense prediction vision tasks.

Dilated convolution filters with stride 1 preserving the spatial dimensions of feature maps is not sufficient to ensure that the feature maps generated by these convolutional filters accurately encode the local neighborhood information of pixels. In particular, we need the dilated convolution filters with stride 1 to limit their spatial scales to local neighborhoods. That being said, dilated convolutional filters with their significant spatial scale exponential growth rates are suitable for encoding global image context information as well. In fact, a well-designed CNN network benefits from dilated convolution filters with a range of different dilation rates in parallel, to encode a group of feature maps with a range of different spatial scales to represent both local and global information concurrently. This idea is used in the following paper.

In Deep Ordinal Regression Network for Monocular Depth Estimation, they generate feature maps with different spatial scales using three parallel dilated convolution layers (referred as ASPP module in the below image) with kernel size of 3 x 3 and dilation rates of 6, 12 and 18. These three convolutional layers with their different dilation rates are applied with stride of 1 and zero-padding to ensure that their output feature maps have the same spatial dimensions in order to be concatenated along their channel dimension. The unified feature map resultant of channel-wise concatenation of the feature maps generated by these 3 dilated convolution layers, is a multi-scale feature map with 3 different spatial scales that encode both local and global information per each entry of feature map.

Three parallel branched denoted by ASPP are 3 x 3 dilated convolution layers with dilation rates of 6, 12 and 18.

The next step for this project is to derive spatial scale and overlap formulas for more number of variations of convolutional and pooling layers, and add them to the code base.