The third way to characterize these methods is by the strategy they employ to extract representations. This is arguably where the “magic” happens in all of these methods and where they differ the most.
To understand why this is important, let’s first define what we mean by representations. A representation is the set of unique characteristics that allow a system (and humans) to understand what makes that object, that object, and not a different one.
This Quora post uses an example of trying to classify a shape. To successfully classify the shapes a good representation might be the number of corners detected in this shape.
In this collection of methods for contrastive learning, these representations are extracted in various ways.
CPC introduces the idea of learning representations by predicting the “future” in latent space. In practice this means two things:
1) Treat an image as a timeline with the past at the top left and the future at the bottom right.
2) The predictions don’t happen at the pixel level, but instead, they use the outputs of the encoder (ie: the latent space)
Finally, the representation extraction happens by formulating a prediction task using the output of the encoder (H) as targets to the context vectors generated by a projection head (which the authors call a context encoder).
In our paper, we find that this prediction task is unnecessary as long as the data augmentation pipeline is strong enough. And while there are a lot of hypotheses about what makes a good pipeline, we suggest that a strong pipeline creates positive pairs that share a similar global structure but have a different local structure.
AMDIM, on the other hand, uses the idea of comparing representations across views from feature maps extracted from intermediate layers of a convolutional neural network (CNN). Let’s unpack this into two parts, 1) multiple views of an image, 2) intermediate layers of a CNN.
1) Recall that the data augmentation pipeline of AMDIM generates two versions of the same image.
2) Each version is passed into the same encoder to extract feature maps for each image. AMDIM does not discard the intermediate feature maps generated by the encoder but instead uses them to make comparisons across spatial scales. Recall that as an input makes its way through the layers of a CNN, the receptive fields encode information for different scales of an input.
AMDIM leverages these ideas by making the comparisons across the intermediate outputs of a CNN. The following animation illustrates how these comparisons are made across the three feature maps generated by the encoder.
The rest of these methods make slight tweaks to the idea proposed by AMDIM.
SimCLR uses the same idea as AMDIM but makes 2 tweaks.
A) Use only the last feature map
B) Run that feature map through a projection head and compare both vectors (similar to the CPC context projection).
As we mentioned earlier, contrastive learning needs negative samples to work. Normally this is done by comparing an image in a batch against the other images in a batch.
Moco does the same thing as AMDIM (with the last feature map only) but keeps a history of all the batches it has seen and increases the number of negative samples. The effect is that the number of negative samples used to provide a contrastive signal increases beyond a single batch size.
Using the same main ideas as AMDIM (but with the last feature map only), but with two changes.
- BYOL uses two encoders instead of one. The second encoder is actually an exact copy of the first encoder but instead of updating the weights in every pass, it updates them on a rolling average.
- BYOL does not use negative samples. But instead relies on the rolling weight updates as a way to give a contrastive signal to the training. However, a recent ablation discovered that this may not be necessary and that in fact adding batch-normalization is what keeps ensures the system does not generate trivial solutions.
Frames their representation extraction task as one of “online clustering” where they enforce “consistency between codes from different augmentations of the same image.” [reference]. So, it’s the same approach as AMDIM (using only the last feature map), but instead of comparing the vectors directly against each other, they compute the similarity against a set of K precomputed codes.
In practice, this means that Swav generates K clusters and for each encoded vector it compares against those clusters to learn new representations. This work can be viewed as mixing the ideas of AMDIM and Noise as Targets.
Characteristic 3, takeaways
The representation extraction strategies is where these approaches all differ. However, the changes are very subtle and without rigorous ablations, it’s hard to tell what actually drives results or not.
From our experiments, we found that the CPC and AMDIM strategies have a negligible effect on the results but instead add complexity. The primary driver that makes these approaches work is the data augmentation pipeline.