Generic filters
Exact matches only

# A definitive guide to effect size

## Learn how to correctly calculate and interpret the effect size for your A/B tests! As a data scientist, you will most likely come across the effect size while working on some kind of A/B testing. A possible scenario is that the company wants to make a change to the product (be it a website, mobile app, etc.) and your task is to make sure that the change will — to some degree of certainty — result in better performance in terms of the specified KPI.

This is when hypothesis testing comes into play. However, a statistical test can only inform about the likelihood that an effect exists. By effect, I simply mean a difference — it can just be a difference in either direction, but it can also be a more precise variant of a hypothesis stating that one sample is actually better/worse than the other one (in terms of the given metric). And to know how big the effect is, we need to calculate the effect size.

In this article, I will provide a brief theoretical introduction to the effect size and then show some practical examples of how to calculate it in Python.

Additionally, when planning A/B tests, we want to estimate the expected duration of the test. This is connected to the topic of power analysis, which I covered in another article. To quickly summarize it, in order to calculate the required sample size, we need to specify three things: the significance level, the power of the test, and the effect size. Keeping the other two constant, the smaller the effect size, the harder it is to detect it with some kind of certainty, thus the larger is the required sample size for the statistical test.

In general, there are potentially hundreds of different measures of the effect size, each one with some advantages and drawbacks. In this article, I will present only a selection of the most popular ones. Before diving deeper into the rabbit hole, the measures of the effect size can be grouped into 3 categories, based on their approach to defining the effect. The groups are:

• Metrics based on the correlation
• Metrics based on differences (for example, between means)
• Metrics for categorical variables

The first two families cover continuous random variables, while the last one is used for categorical/binary features. To give a real-life example, we could apply the first two to a metric such as time spent in an app (in minutes), while the third family could be used for conversion or retention — expressed as a boolean.

I will describe some of the measures of effect size below, together with the Python implementation.

As the first step, we need to import the required libraries: