Skip to content
Search
Generic filters
Exact matches only

A Complete Beginners Guide to Document Similarity Algorithms

Cosine similarity is the cosine of the angle between 2 points in a multidimensional space. Points with smaller angles are more similar. Points with larger angles are more different.

While harder to wrap your head around, cosine similarity solves some problems with Euclidean distance. Namely, magnitude.

Number of times an article mentions the words “cooking” and “restaurant”

In the above drawing, we compare 3 documents based on how many times they contain the words “cooking” and “restaurant”.

Euclidean distance tells us the blog and magazine are more similar than the blog and newspaper. But I think that’s misleading.

The blog and newspaper could have similar content but are distant in a Euclidean sense because the newspaper is longer and contains more words.

In reality, they both mention “restaurant” more than “cooking” and are probably more similar to each other than not. Cosine similarity doesn’t fall into this trap.

Let’s work through our above example. We’ll compare documents based on the count of specific words

Rather than taking the distance between each, we’ll now take the cosine of the angle between them from the point of origin. Now even just eyeballing it, the blog and the newspaper look more similar.

Note that cosine similarity is not the angle itself, but the cosine of the angle. So a smaller angle (sub 90 degrees) returns a larger similarity.

https://en.wikipedia.org/wiki/Cosine_similarity

Let’s implement a function to calculate this ourselves.

Now we see that the blog and newspaper are indeed more similar to each other.

In production, we’re better off just importing Sklearn’s more efficient implementation.

Same values. Great!