Skip to content
Generic filters
Exact matches only

Adding Error Bars to 5-Star Reviews: A Bayesian Approach

Eq. 2

The first term in the numerator on the right-hand side is the prior (our initial belief on the distribution of true rating) and the second term is the likelihood. To be conservative, we can use a flat prior for the rating, that is to say, we assume that before seeing any of the reviews it’s equally likely for the socks to have any rating between 1 to 5 stars. The likelihood for each individual review is a binomial distribution shown in Eq. 1 above, so to construct the full likelihood we only need to replace each k with the observed review score and multiply everything together. Remember that there is a one-to-one mapping between the bias of the coin p (see Eq 1.) and true rating (Eq. 2) and I’m using them interchangeably.

In order to find the probability distribution of the true rating of each item, I’m going to use PyMC3 to sample their posteriors. The following function takes in an instance of the Sock object and returns the MCMC samples from the posterior in Eq. 2. Going over the details of how PyMC3 works is beyond the scope of this article but feel free to ask questions in the comment section or post issues on the Github Repo if anything is confusing.

Note: In the prior section of the above function I’ve used a Beta function notation for p to describe the flat prior because Beta is the conjugate prior to the binomial distribution. This is technically unnecessary since we’re not solving this problem analytically but I did it anyway so I could write this little note about it!

Now let’s sample the posterior for all the socks and find the mean and standard deviation of the probability distributions.

Now in addition to an average, we have a standard deviation too! We can add this to the original histogram as error bars (I’m being a bit careless here simply adding the std error around the mean which is not strictly speaking correct for non-normal data). Here is what it looks like:

As expected, the socks with a larger number of reviews have smaller error bars. Very interesting, but maybe not as informative as we hoped. The inferred rating for the orange socks with only 2 reviews is still larger than the inferred rating of the red one with 100 reviews. Then how do we pick the correct pair?

Let’s check out the KDE of the posterior to see the actual shape of the distribution. This plot shows the posterior probability of the true rating for each pair which is: our initial belief of what the rating might be (uniform probability of all ratings in this case) updated based on the likelihood of the observation of the user reviews (binomial likelihood of Eq. 1).

Let’s make a few observations here. Just as a reminder, the (blue, orange, red) socks with the true ratings of (3.2, 4.0, 4.5) each had (20, 2, 100) reviews. First, we see that the width of the posterior PDFs are inversely proportional to the number of reviews for each item: the more reviews we have, the less uncertain we are about the final estimate of the true rating. Second, the peak of the blue and red PDFs are pretty close to the true ratings, but for the orange pair not so much. This makes sense because for the orange socks we only had 2 observations which are not very limiting. The long tail of the orange posterior admits to the fact that it’s very uncertain about the final result. But still, it’s giving us the information that we were looking for in this analysis.

What if instead of using the mean of the posteriors (50th percentile) to rank the socks — as we did in the second plot — we use the 5th percentile? This way we are 95% sure that the true rating lies above this number. With this ranking strategy either if the socks have lower ratings or if they have a smaller number of reviews — which means a wider tail in the distribution — they will sink lower in the list. Using the 5th percentile for our posterior PDFs results in the following ordering:


And that’s it… Now that we’ve added error bars to our socks, we can confidently buy the best pair. And that’s definitely what Reverend Thomas Bayes would have done!

Important Note: Do not attempt to implement this analysis in real life. Socks or other types of undergarments purchased solely based on the results of this analysis might not match with the rest of your outfit.