Skip to content
Generic filters
Exact matches only

Accuracy’s unpopular best friend in recommenders

It keeps your customers interested.

In the earlier example, the music recommender was very accurate — it only recommended music from artistes you liked and previously listened to. A model solely focused on accuracy would do very well in offline evaluation (of the latest week of data).

But in the real world, it would suck. It would recommend very similar music daily. Users would get bored soon.

People intrinsically enjoy variety and discovering things such as new artistes, genres, communities, etc. Helping users discover new products keeps them interested.

It’s good for business.

Keeping customer utility high is important. But besides customer utility, you also need to care about assortment health and seller health.

From an e-commerce perspective, poor assortment health means a small percentage of products, categories, or sellers (5%) get a disproportionately large portion of sales and revenue (95%). If those top sellers decide to move to a competitor platform, it’s gonna hurt. The same would happen if a top-selling product stops being available (e.g., out of stock, banned, cease production).

Introducing serendipity (and the associated metrics of diversity and novelty) help recommend more products from the long-tail of assortment. This distributes sales more evenly, reducing dependency and risk on the minority of products and sellers.

As a plus, you get to expose long-tail and cold-start products to customers. This helps gather training data which you can use to improve your recommender. It’s a virtuous cycle.

The tail becomes bigger and longer, shifting from blue to red (source: Wikimedia commons)

“If you can’t measure it, you can’t improve it” — Peter Drucker

Unfortunately, there’s no industry agreed-upon standard to measure serendipity. Unlike relevance metrics such as non-discounted cumulative gain (NDCG), mean average precision (MAP), recall, precision, etc.

To gain a better understanding of how to measure serendipity, I went through 10+ papers on serendipity in recommender systems. Here’s a summary of the key metrics and the various ways to measure them.


Diversity measures how narrow or wide the spectrum of recommended products are. A recommender that only recommends the music of one artiste is pretty narrow; one that recommends across multiple artistes is more diverse.

There are two main ways of measuring diversity — based on item and based on users.

Measuring diversity based on item is straightforward. We can do this based on metadata of the recommended items:

  • How many different categories/genres?
  • How many different artistes/authors/sellers?
  • What is the kurtosis (“tailness”) of the price distribution?
  • How different (i.e., distant) are the product embeddings?

Another approach is measuring diversity based on existing customers. For each item in the recommended set, who has consumed it? If the items have a relatively large proportion of common users, the recommended items are likely very similar.

One way to measure this is cosine similarity (Ziegler et al., 2004). For two products, what’s the proportion of common users?

(Source (of all equation images): Author; there’s no easy way for equations on medium 🤷‍♂.)

This can then be extended to the set of recommended items.

(Note: Want to view the original latex equations? They’re available here.)


Novelty measures how new, original, or unusual the recommendations are for the user.

In general, recommendations will mostly consist of popular items because (i) popular items have more data and (ii) popular items do well in offline and online evaluations.

However, if an item is popular or top-selling, a user would already have been exposed to it. This could happen via your “top-selling” or “trending” banners, social media, or a user’s relationships (i.e., family, friends, co-workers). Therefore, it makes sense to tweak a recommender for novelty to reduce the number of popular items it recommends.

The common way I’ve seen papers measure novelty is to compare a user’s recommended items to the population. How often does a user’s recommendations occur in the rest of the population’s recommendations? There are two forms of measuring this (Zhang et al 2011, Vargas & Castells, 2011):

In both equations, the numerator counts the number of users who were recommended the item. The denominator counts the total number of users. In both cases, if all users were recommended the item, then novelty would be zero.

Some papers measure novelty this way, comparing a user’s recommendations against the population’s recommendations. I’m not sure if this is the best approach to measure novelty — how does it matter to the user what recommendations other people receive?

If a user is the only person recommended a product (e.g., a Harry Potter book), novelty — as measured above — would be close to maximum.

But the rest of the population might already have purchased/read the book (and thus not be recommended it). If everyone around the user has bought, read, or talked about the product — is the recommended product still novel to the user?

A better measure of novelty is to consider the population’s interactions (e.g., purchases, clicks) instead of recommendations. This reflects how likely the user has been exposed to the product based on the population historical engagement.

What if we want to measure novelty specific to a user? Most literature refers to that as unexpectedness or surprise.

Unexpectedness (aka surprise)

One measure of unexpectedness is to compare a user’s new recommendations (from an updated recommender) against previous recommendations. This measures “how much surprise are we introducing with this serendipity feature”. It’s useful for tracking incremental improvements on recommendation systems.

Another approach is to compare recommendations relative to the user’s historical item interaction. This measures “how surprising are these recommendations given what the user previously bought/clicked on”. I believe this is a better way to operationalise surprise.

One way to measure this is via point-wise mutual information (PMI). PMI indicates how similar two items are based on the number of users who have purchased both items and each item separately (similar to measuring diversity based on users).

p(i) is the probability of any user consuming item I, while p(i, j) is the probability of a user consuming both. PMI ranges from -1 to 1, where -1 indicates that the two items are never consumed together.

Another approach is to consider some distance metric (e.g., cosine similarity). We compute cosine similarity between a user’s recommended items (I) and historical item interactions (H). Lower cosine similarity indicates higher unexpectedness.


Serendipity is measured as unexpectedness multiplied by relevance, where relevance is 1 if the recommended item is interacted with (e.g., purchase, click) and 0 otherwise. For a recommended item (i), we only consider unexpectedness if the user interacted with i.

To get overall serendipity, we average over all users (U) and all recommended items (I).

It seems straightforward to implement this in code but in reality, it’s a tricky matter. If your recommender system is deliberately introducing long-tail and cold-start products, you can expect the relevance metric to perform poorly in offline evaluations. Nonetheless, the recommendations might still be useful to customers and perform well in an A/B test.

(In a previous project where I deliberately introduced cold-start products in a product ranking system, the offline evaluation metrics were bad and we expected conversion to drop a bit. To our surprise, we actually saw conversion improve during the A/B test.)

So take offline evaluation metrics of serendipity (and relevance) with a pinch of salt. Use them to compare between multiple recommenders but don’t let them dissuade you from starting an A/B test.