A practical example of using the normal-normal model
Maybe you’re an investor trying to decide whether a stock is worth investing in. Maybe you’ve only recently heard of Bayesian inference and want to get a sense of how it can be applied in the real world. Maybe you’re a seasoned analyst who stumbled upon this article and found the title interesting. Regardless of where you come from, I thank you for giving this piece a read. I’m going to talk about the normal-normal model, one of the foundational models in Bayesian statistics, and how it can be used to estimate the growth rate of a company’s revenue. That estimate can then be used to decide whether or not the company is a worthwhile investment.
The first objective of this piece is to demonstrate how the normal-normal model can be used to incorporate a subjective overlay into data analysis. The second is to provide some intuition behind the normal-normal model and Bayesian inference in general without getting too bogged down in the mechanics. I’ll say it here and again at the end of the article, but this piece does not constitute investment advice. It is meant to be educational.
With that disclaimer out of the way, let’s get to it!
Financial modeling generally refers to projecting fundamental values for a company in order to arrive at a fair price estimate for the company’s stock. Some of the most common metrics used to arrive at valuations are revenue, earnings, and cash flow. The company we’re going to look at is MongoDB, a software services company. It began trading publicly back in 2017, and its revenue growth has been tremendous.
Given how young the company is and how it’s in a growth-oriented phase of its existence, it’s reasonable to focus on revenue in order to value the company. Data in the company’s 10-K filings, the annual financial reports, shows revenue numbers on a quarterly basis starting in fiscal 2016. Annual numbers are present from the year 2014. To give us more data than the six annual numbers (which translate into five growth numbers), I’ve computed rolling one-year revenue growth on a quarterly basis. That data is shown below.
Closer to the end of this piece, I’ll compare the results of our analysis using year-end data versus quarterly data. (Although I haven’t run a formal analysis, I assume there’s a degree of serial correlation in the quarterly data. This won’t matter in terms of explaining the concepts of the normal-normal model, but it is certainly something to be mindful of in practice.)
A common way to project revenue for a company is to use the average historical revenue growth rate over a certain amount of time. For companies with many years of data, this isn’t necessarily a bad practice, especially if the growth rates follow a normal distribution. Given how little sample data we have and the histogram of the data which I’ll plot below, we may feel that using the sample mean in this case is unwise.
Bayesian inference is particularly useful in situations where our sample size is small and we hold a subjective belief that our sample data does not appropriately represent what a larger sample would look like.
To conduct Bayesian inference, we’ll need a prior distribution and a sampling model. Before defining those distributions in our context, I’ll go over some of the basics of Bayesian inference and how the prior distribution and sampling model come into play. Feel free to skip this section if you’re familiar with Bayes’ theorem and how it applies to distributions.
In its simplest form, Bayes’ theorem is defined as
which is equivalent to
This is all well and good if we have neatly defined probabilities to use, but distributions complicate the process a little.
First, let’s substitute A with θ and B with Y. In this case, Y refers to the points in our sample data, and θ refers to the true average growth rate in revenue for MongoDB. Re-writing the second form of the formula with our substitutions, we have
In words, the distribution we’re trying to model is the distribution of average revenue growth rate GIVEN our sample growth rates. We will use our sample data and a little bit of judgement to define this distribution P(Y|θ). We will also need a prior distribution P(θ) for our average growth rate and the marginal distribution of our data P(Y). The onus is on us to define our sampling distribution as well as define a prior distribution for θ. Once we have a sampling distribution P(Y|θ), the correct way to obtain P(Y) would be to solve for the integral below:
In practice, this may be difficult to do, but we can use a shortcut. Since Y is only conditional on θ in this instance, P(Y) is an unconditional probability distribution and encompasses all possibilities of Y. This means that the area under the distribution will be equal to 1 (the sum of all probabilities for an event equals 1), and the integral will be equal to 1 multiplied by a normalizing constant. Rather than solve for this normalizing constant, we can instead say
where ∝ stands for “is proportional to.” In other words, we don’t need to worry about P(Y). With one task eliminated, we only have to define our sampling and prior distributions.
(Note: technically, Y is conditional on sample variance. In this case, we are going to assume that the variance is known and constant. Because our variance is assumed to be known and a constant, we can omit it from the notation.)
We’re going to use a normal model for our sampling distribution. Having looked at the histogram for our data, one may think that there are distributions available to us that better represent the data. I like the normal distribution in this case because it is continuous and has support along all real numbers (revenue growth could theoretically be negative or positive).
To define this sampling model, we compute the mean and variance for this data set and use these as the parameters for our sampling model. The form this will take is
where the first term represents the unknown true average growth rate for MongoDB’s revenue and the second term represents the variance of the growth rates; we will treat this variance as known. We could just as easily assume that we know our mean but not our variance or that we know neither; all three classes of situations are well-documented and have substantial literature regarding how to work them. The normal-normal model applies to the situation with known variance and unknown mean, hence why we are making our current assumptions.
Next, we need to define a prior distribution for θ. For the same reasons that we’re using a normal distribution for the sampling model (continuous, support along positive and negative values), we’re going to use a normal distribution as our prior. We need to define a mean and a variance for the variable θ. We’ll define this distribution as
where the first term is the prior mean and the second term is the prior variance. There is significant literature dedicated to selecting priors; the main focus of this piece is how to apply the normal-normal model, so I didn’t put extensive effort in defining my prior distribution.
To select a value for the prior mean, I looked at the average revenue growth rate of sales of the S&P 500 index over the last 19 years (multpl.com) and then multiplied it by the β of MongoDB. In the world of equities, β refers to the covariance of an individual stock’s returns with the return of broader basket of stocks (often called an index) divided by the variance of the index returns. MongoDB has a β of about 1.26 according to Seeking Alpha, a research site with news, data, and analyses of many stocks. Whenever we see a β > 1, we can assume that the stock we are looking at is more volatile than the index it is being compared to; for this reason, I multiply the revenue growth of the index by β. Other approaches could involve looking at slightly older companies in the software service industries or similar age companies across industries. No method is perfect, and all are viable.
The next parameter we have to assign is the prior variance. Just to be clear, this is not what we presume is the variance in growth rates, but the presumed variance of the AVERAGE growth rate; this prior variance is meant to reflect our certainty in the accuracy of the prior mean. If we had full confidence that this was the correct mean to use, we could set our variance effectively equal to 0 (for computation purposes, we can’t actually use 0, but we can use a very small number such as .00001). On the other hand, if we have very little confidence in our estimate, we can use a large variance to indicate this level of certainty. In this case, where our prior mean is about 4.5%, I don’t have much of an opinion of how confident I am with this estimate. To define my distribution, I’ll use a standard deviation of 10%. With this, I’m effectively stating that I’m 95% confident that the true value for theta lies between -15.5% and 24.5% (4.5+/-2 standard deviations). This estimate may seem highly conservative given how MongoDB’s average growth rate has been about 61%, but this is exactly why Bayesian inference is powerful. MongoDB has spent the majority of its time trading in a bull market that was particularly favorable for software names. The prior distribution reflects data from multiple market cycles and consequently multiple phases of growth and contraction. Between the possibility of economic contraction, the chance MongoDB doesn’t execute its strategy effectively, and revenue growth slowing simply due to scale, I’m holding the subjective belief that MongoDB’s true average growth rate is less than what the sample data suggests. The prior distribution I’ve selected represents that belief. Now, we can study the output of our analysis.
To recap, here are the forms for our two models:
Great, let’s move on to our analysis!
I’ll focus more on the intuition offered by these forms rather than walk through a derivation by hand. Anyone truly interested in using the normal-normal model should study the derivation of the above parameters. Wikipedia has some good documentation, and most introductory textbooks to Bayesian statistics cover the derivations in detail.
When we have a normal distribution for our sampling model as well as a normal for our prior distribution on the sample mean, the resulting posterior distribution is a product of two normal models. The power of the normal-normal model is that the product of these distributions is also a normal distribution, albeit with updated parameters. In Bayesian jargon, a normal prior distribution is a conjugate prior distribution, meaning that it and its resulting posterior distribution have the same form. The fact that our posterior distribution is a normal distribution may not seem like that big of a deal, but depending on the data we’re trying to model and the parameters we’re trying to estimate, there are many instances where our posterior does not take such a familiar form. Because this posterior distribution is well-defined, we can sample from it directly and consequently compute summary statistics on it easily.
The notations and re-parametrizations below are from Chapter 5 in Peter Hoff’s textbook, “A First Course in Bayesian Statistics,” the book I used in my first undergraduate Bayesian statistics course and the book I’ve been studying in recent times.
Our posterior distribution takes the form
where the first term refers to the posterior mean and second term refers to the posterior variance. The formulas to calculate these updated parameters are
These formulas may look somewhat intimidating, but hopefully you see some similarities between them. A common practice and a particularly helpful one for gaining intuition about these formulas is to look at the formulas in terms of precision rather than variance. Precision is the inverse of variance.
In this case, we have three relevant precisions to observe:
If we invert the posterior variance formula to calculate posterior precision, we see that the posterior precision in terms of standard deviations is
This can be written in terms of precisions as
In this form we can clearly see that the posterior precision is the sum of the prior precision and the sample precision multiplied by the sample size. We can also re-write the posterior mean in terms of precisions:
Here, we can clearly see that the posterior mean is a weighted average of the prior mean and sample mean.
For our data, the posterior parameters are:
And there we have them — our updated parameters. Our posterior estimate for the average growth rate is about 52.7% — a decent bit lower than our sample average, but not overwhelmingly lower. We’ve taken a subjective belief, represented that belief with a distribution, and used that distribution to augment our analysis. Hooray! This is the power of Bayesian inference. As long as we can define our beliefs, we can incorporate them in a rigorous way in our analysis. Let’s talk a little more about what we have and also what we don’t have.
With our posterior standard deviation, we can compute a credible interval for our estimate. For those new to Bayesian statistics, a credible interval is not the same thing as a confidence interval even though they are computed in a similar manner. Our 95% credible interval for the posterior mean is .527+/−2∗.0391.527+/−2∗.0391 which leads to points of 44.88% and 60.52%. With this credible interval, we’re making the statement that we’re 95% sure that the true value of the posterior mean falls within the interval. Even at this point, we don’t treat this updated mean as a known entity. Furthermore, we are not saying that 52.7% is our forecast for revenue growth rate over the next rolling one-year period. If we wanted to make a forecast within this framework, we’d use the posterior predictive distribution. Since that is a separate topic, I won’t touch on it here, but the process of deriving that distribution is similar to deriving the posterior distribution.
Two key implications should be noted from this analysis: the first is that as sample size grows larger, the posterior mean and posterior variance are more and more determined by the sample data. I’m not going to state that there’s an explicit cutoff, but at some amount of data, adding a prior doesn’t move the needle much all else equal. Intuitively, this is reasonable. If you have rich enough sampling data, the sampling data likely represents the actual structure in the data, and you may not see the need to utilize a prior distribution.
To emphasize the first point, we can re-run our analysis using strictly the year-end data which would leave us with a sample size of five data points. Using the same prior distribution, our new sampling mean and variance are about 59.8% and .012 (or 11.1% standard deviation), and our posterior mean and variance are 23% and .0019 (or 4.45% standard deviation). This posterior estimate for the mean is much lower than what we saw in our first iteration; with our sample size cut significantly, the prior plays a much heavier role in the output. The standard deviation didn’t change as much, but we can see that it’s larger even though our sampling standard deviation was smaller the second time around. We have a much lower estimate, and we have slightly less confidence in the estimate (wider credible interval).
The second implication of our analysis is that the smaller the prior variance, the greater the prior precision and the greater impact it has on both the posterior mean and posterior variance. The more confidence we have in our prior, the more it will affect our posterior estimates. To illustrate this point, I re-ran our original analysis with different values for the prior variance. The values for the prior mean are all .045, and the sampling mean and variance come from our rolling revenue data. The table below shows the results of this experiment.
I’ll also plot the distributions.
Notice how much closer to the prior mean our posterior distribution with prior variance set to .05 is. As we increase our prior variance (effectively signifying less confidence in the prior mean), the center of our posterior distribution moves closer to the sample mean. Also, while the magnitude of the changes in the posterior variances may not appear that great in the table, from the distribution plots above, we can see how the distributions get progressively wider; in other words, the credible interval for the true value of average growth widens.
Just to recap, we were analyzing a young company and wanted to estimate the true growth rate of its revenue. Given the small amount of sample data we had and a subjective belief that the average growth rate will be less than what the sample data suggests, we used Bayesian inference to augment our analysis. We defined a sampling model for our data, defined a prior for the average growth rate that reflected our subjective view, and utilized the normal-normal model to arrive at a posterior estimate and interval for the company’s average growth rate. I hope you found this brief introduction to Bayesian inference as well as the analysis of the results useful. I don’t recommend using the specific numbers in this piece for any valuation of MongoDB, but hopefully you can apply the concepts to your own analysis. I’m attaching a link to the GitHub repository for the code; nothing is particularly complicated, but I’ll share it in the spirit of transparency and reproducibility.
Lastly, I want to thank the friends and family members who took time to read my drafts and provide feedback throughout the process. As this is my first time writing about a project in this manner, their support is especially appreciated. Thanks, and take care!
The thoughts and views expressed in this report are mine alone and do not necessarily reflect the views of my firm. This report is intended to be educational in nature and should not be construed as individual investment advice nor as a recommendation to buy, sell, or hold any security or to adopt any investment strategy.
 Hoff, Peter D. A First Course in Bayesian Statistical Methods (2007). Print.