Skip to content
Search
Generic filters
Exact matches only

A definitive guide to effect size

Photo by Kaboompics .com from Pexels

Learn how to correctly calculate and interpret the effect size for your A/B tests!

Eryk Lewinson

As a data scientist, you will most likely come across the effect size while working on some kind of A/B testing. A possible scenario is that the company wants to make a change to the product (be it a website, mobile app, etc.) and your task is to make sure that the change will — to some degree of certainty — result in better performance in terms of the specified KPI.

This is when hypothesis testing comes into play. However, a statistical test can only inform about the likelihood that an effect exists. By effect, I simply mean a difference — it can just be a difference in either direction, but it can also be a more precise variant of a hypothesis stating that one sample is actually better/worse than the other one (in terms of the given metric). And to know how big the effect is, we need to calculate the effect size.

In this article, I will provide a brief theoretical introduction to the effect size and then show some practical examples of how to calculate it in Python.

Additionally, when planning A/B tests, we want to estimate the expected duration of the test. This is connected to the topic of power analysis, which I covered in another article. To quickly summarize it, in order to calculate the required sample size, we need to specify three things: the significance level, the power of the test, and the effect size. Keeping the other two constant, the smaller the effect size, the harder it is to detect it with some kind of certainty, thus the larger is the required sample size for the statistical test.

In general, there are potentially hundreds of different measures of the effect size, each one with some advantages and drawbacks. In this article, I will present only a selection of the most popular ones. Before diving deeper into the rabbit hole, the measures of the effect size can be grouped into 3 categories, based on their approach to defining the effect. The groups are:

  • Metrics based on the correlation
  • Metrics based on differences (for example, between means)
  • Metrics for categorical variables

The first two families cover continuous random variables, while the last one is used for categorical/binary features. To give a real-life example, we could apply the first two to a metric such as time spent in an app (in minutes), while the third family could be used for conversion or retention — expressed as a boolean.

I will describe some of the measures of effect size below, together with the Python implementation.

Photo by Vishwarajsinh Rana on Unsplash

As the first step, we need to import the required libraries:

1. The correlation family

Before diving into the metrics, we will generate some random, correlated variables coming from the multivariate Normal distribution. They have different means, so we can actually detect some effect, while we keep the variance at 1 for simplicity.

Remember that the more random observations we generate, the more their distribution will resemble the one we specified.

Pearson’s r

This should not come as a surprise, as the name of the family is based on this metric. Pearson’s correlation coefficient measures the degree of linear association between two real-valued variables. The metric is unit-free and is expressed as a number in the range of [-1, 1]. For brevity, I will only describe the interpretation of extreme cases:

  • a value of -1 indicates a perfect negative relationship between variables,
  • a value of 0 indicates no linear relationship,
  • a value of 1 indicates a perfect positive relationship.

As this is one of the most commonly used metrics in general, there are many ways to calculate the correlation coefficient in Python:

  • pearsonr in scipy.stats — in addition to the correlation coefficient, we also receive the p-value of the correlation test. Quoting the documentation: “The p-value roughly indicates the probability of an uncorrelated system producing datasets that have a Pearson correlation at least as extreme as the one computed from these datasets.
stats.pearsonr(x[:,0], x[:,1])
# (0.6023670412294826, 0.0)
  • numpy.corrcoef — returns the correlation matrix, use rowvar to indicate whether the observations of the random variables are stored in rows or columns.
np.corrcoef(x, rowvar=False)
# array([[1. , 0.60236704],
# [0.60236704, 1. ]])
  • the corr method of a pandas DataFrame/Series.
  • pingouin’s corr function — by default it returns the Pearson’s r coefficient (other measures of correlation are also available). In contrast to scipy, the function returns a bit more detailed results of the correlation test. We can also use the one-sided variant of the test.
pg.corr(x[:, 0], x[:, 1])

Coefficient of determination (R²)

The second measure of the effect size in this family is the coefficient of determination, also known as R². It states what proportion of the dependent variable’s variance is explained (predictable) by the independent variable(s). In other words, it is a measure of how well the observed outcomes are replicated by the model.

There are several definitions of the coefficient of determination, however, the most relevant one for us right now is the one connected to Pearson’s r. When using simple linear regression (with one dependent variable) with the intercept included, the coefficient of determination is simply the square of the Pearson’s r. If there are more dependent variables, the R² is the square of the coefficient of multiple correlation.

In either of the mentioned cases, the coefficient of determination normally covers the range between 0 and 1. However, if another definition is used, the values can become negative as well. Due to the fact that we square the correlation coefficient, the coefficient of determination does not convey any information about the direction of the correlation.

We can calculate the coefficient of determination by running simple linear regression and inspecting the reported value:

pg.linear_regression(x[:, 0], x[:, 1])

In both cases, the coefficient of determination is close to 0.36, which is the square of the correlation coefficient (0.6).

Eta-squared (η²)

The last considered metric in this family is the eta-squared. It is a ratio of variance explained in the dependent variable by a predictor while controlling for other predictors, which makes it similar to the r².

where SS stands for the sum of squares. η² is a biased estimator of the population’s variance explained by the model, as it only estimates the effect size in the considered sample. This means that eta-squared will always overestimate the actual effect size, although this bias becomes smaller as the sample size grows. Eta-squared also shares the weakness of r² — each additional variable automatically increases the value of η².

To calculate eta-squared in Python, we can use the pingouin library:

pg.compute_effsize(x[:, 0], x[:, 1], eftype='eta-square')
# 0.057968511053166284

Additionally, the library contains a useful function called convert_effsize, which allows us to convert the effect size measured by Pearson’s r or Cohen’s d into, among others, eta-squared.

2. The “difference” family

In practice, the population values are not known and have to be estimated from sample statistics. That is why there are multiple methods for calculating the effect size as the difference between means — they differ in terms of which sample statistics they use.

On a side note, such a form of estimating the effect size resembles calculating the t-statistic, with the difference being dividing the standard deviation by the square root of n in the t-statistic’s denominator. Unlike the t-statistic, the effect size aims to estimate a population-level value and is not affected by the sample size.

This family is also known as the “d family”, named after the most common method of estimating the effect size as a difference between means — Cohen’s d.

Before diving into the metrics, we define two random variables coming from the Normal distribution. We use different means and standard deviations to make sure that the variables differ enough to obtain reasonable effect sizes.

Distributions of the two random variables

Cohen’s d

Cohen’s d measures the difference between the means of two variables. The difference is expressed in terms of the number of standard deviations, hence the division in the formula. Cohen’s d is defined as:

where s is the pooled standard deviation and s_1, s_2 are standard deviations of the two independent samples.

Note: Some sources use a different formulation of the pooled standard deviation and do not include the -2 in the denominator.

The most common interpretation of the magnitude of the effect size is as follows:

  • Small Effect Size: d=0.2
  • Medium Effect Size: d=0.5
  • Large Effect Size: d=0.8

Cohen’s d is very frequently used in estimating the required sample size for an A/B test. In general, a lower value of Cohen’s d indicates the necessity of a larger sample size and vice versa.

The easiest way to calculate the Cohen’s d in Python is to use the the pingouin library:

pg.compute_effsize(x, y, eftype='cohen')
# -0.5661743543595718

Glass’ Δ

The rationale for using only the standard deviation of the control group was based on the fact that if we were to compare multiple treatment groups to the same control, this way we would have the common denominator in all the cases.

pg.compute_effsize(x, y, eftype='glass')
# -0.6664041092152272

Hedge’s g

Cohen’s d is a biased estimator of the population-level effect size, especially for small samples (n < 20). That is why Hedge’s g corrects for that by multiplying the Cohen’s d by a correction factor (based on the gamma functions):

pg.compute_effsize(x, y, eftype='hedges')
# -0.5661722311818571

We can see that the difference between Cohen’s d and Hedge’s g is very small. It would be more pronounced for smaller sample sizes.

3. The categorical family

φ (phi coefficient)

The phi coefficient is a measure of association between two binary variables introduced by Karl Pearson, and is related to the chi-squared statistic of a 2×2 contingency table. In machine learning terms, a contingency table is basically the same as the confusion matrix.

Two binary random variables are positively associated when most of the data falls along the diagonal of the contingency table (think about true positives and true negatives). Conversely, the variables are negatively associated when most of the data falls off the diagonal (think about false positives and false negatives).

As a matter of fact, the Pearson’s correlation coefficient (r) calculated for two binary variables will result in the phi coefficient (we will prove that in Python). However, the range of the phi coefficient is different from the correlation coefficient, especially when at least one of the variables takes more than two values.

What is more, in machine learning we see the increasing popularity of the Matthews correlation coefficient as a measure of evaluating the performance of classification models. In fact, the MCC is nothing else as Pearson’s phi coefficient.

Phi/MCC considers all the elements of the confusion matrix/contingency table, that is why it is considered a balanced evaluation metric that can also be used in cases of class imbalance.

Running the code results in the following output:

Phi coefficient: 0.000944
Matthews Correlation: 0.000944
Pearson's r: 0.000944

Cramér’s V

Cramér’s V is another measure of association between categorical variables (not restricted to the binary case).

where k and r stand for the number of columns and rows in the contingency table and φ is the phi coefficient as calculated above.

Cramér’s V takes a value in the range of 0 (no association between the variables) and 1 (complete association). Note that for the case of a 2×2 contingency table (two binary variables), Cramér’s V is equal to the phi coefficient, as we will soon see in practice.

The most common interpretation of the magnitude of the Cramér’s V is as follows:

  • Small Effect Size: V ≤ 0.2
  • Medium Effect Size: 0.2 < V ≤ 0.6
  • Large Effect Size: 0.6 < V
Cramer's V: 0.000944

We have indeed obtained the same value as in the case of the phi coefficient.

Cohen’s w

Cohen suggested another measure of the effect size, which “increases with the degree of discrepancy between the distribution specified by the alternate hypothesis and that which represents the null hypothesis” (for more details, see page 216 in [1]). In this case, we are dealing with proportions (so fractions of all observations), in contrast to the contingency tables for the previous metrics.

where:

  • p_{0i} — the proportion in cell i under the null hypothesis,
  • p_{1i} — the proportion in cell i under the alternative hypothesis,
  • m — number of cells.

The effect size measured by Cohen’s w is considered small for values close to 0.1, medium for around 0.3, and large for around 0.5.

Cohen's w: 0.173820

Cohen’s h

Another measure used for comparing proportions from two independent samples is Cohen’s h, defined as follows:

where p_1 stands for the proportion of the positive cases in the first sample. To assess the magnitude of the effect size, the author suggests the same range of indicative values as in the case of Cohen’s d.

Cohen's h: 0.174943

Odds Ratio

The effect size measured by the odds ratio is computed by noting that the odds of an event happening in the treatment group are X times higher/lower than in the control group.

Odds Ratio: 1.374506

The odds of an event (for example conversion) happening are ~1.37 times higher in the x group than in the y one, which is in line with the probabilities provided while generating the data.

BONUS: Common language effect size

the probability that a score sampled at random from one distribution will be greater than a score sampled from some other distribution.

To make the description as clear as possible, I will paraphrase the example mentioned in the paper. Imagine that we have a sample of heights of adult men and women, and the CLES is 0.8. This would mean that in 80% of randomly selected pairs, the man will be higher than the women. Or to put it differently, in 8 out of 10 blind dates, the man will be higher than the woman.

The distribution of the randomly generated heights

You can find the code used for this article on my GitHub. As always, any constructive feedback is welcome. You can reach out to me on Twitter or in the comments.

In case you found this article interesting, you might also like:

[2] McGraw, K. O., & Wong, S. P. (1992). A common language effect size statistic. Psychological Bulletin, 111(2), 361–365. https://doi.org/10.1037/0033-2909.111.2.361