Going back to our example, you could imagine a model that has as many clusters as there are data points. See, no outliers!
But that wouldn’t be a very useful model.
All models are wrong, but some are useful.
We have to balance the maximum likelihood of our model, L, against the number of model parameters, k. We seek the model with the fewest number of parameters that still does a good job explaining the data. So we introduce a penalty for the number of model parameters.
We are now most of the way to the Bayesian Information Criterion (BIC).
The BIC balances the number of model parameters k and number of data points n against the maximum likelihood function, L. We seek to find the number of model parameters k that minimizes the BIC.
This form of the BIC derives from a paper by Gideon Schwarz  from 1978. The derivation can be hard to follow, so we won’t go into it here.
Computing the maximum likelihood function is the hard part, but there exist analytical functions for most common models. In linear regression, for example, the log-likelihood is just the mean squared error.
Standard machine learning libraries will usually compute the likelihood function for you, so don’t despair.
Let’s finish our example of data clustering. I want to cluster the data using a Gaussian mixture model and determine the best number of clusters to choose. In Python, using the scikit-learn library, here’s how:
Plotting the BIC for different values of k, we can see how the BIC is minimized for 3 clusters.
The BIC agrees with our initial visual estimation. It also tells us that a larger number of clusters would also fit the data fairly well, but at the cost of having to introduce more parameters.
You can always find a model that will fit your data, but that does not make it a great model. Following the principle of Occam’s razor, we should always choose the model that makes the fewest assumptions. In machine learning, an overfit model performs poorly in the wild.
With four parameters I can fit an elephant, and with five I can make him wiggle his trunk.
— John von Neumann
Using the Bayesian Information Criterion, you can find the simplest possible model that still works well. Hopefully this article has given you an intuitive feeling for how it works.
 G. E. Schwarz, Estimating the Dimension of a Model (1978), Annals of Statistics, 6 (2): 461–464