Naive Bayes starts with just counting and then moving to probability
Naive Bayes classification is one of the most simple and popular algorithms in data mining or machine learning (Listed in the top 10 popular algorithms by CRC Press Reference ). The basic idea of the Naive Bayes classification is very simple.
Let’s say, we have books of two categories. One category is Sports and the other is Machine Learning. I count the frequency of the words of “Match” (Attribute 1) and Count of the word “Algorithm” (Attribute 2). Let’s assume, I have a total of 6 books from each of these two categories and the count of words across the six books looks like the below figure.
We see that clearly that the word ‘algorithm’ appears more in Machine Learning books and the word ‘match’ appears more in Sports. Powered with this knowledge, Let’s say if I have a book whose category is unknown. I know Attribute 1 has a value 2 and Attribute 2 has a value 10, we can say the book belongs to Sports Category.
Basically we want to find out which category is more likely, given attribute 1 and attribute 2 values.
This count-based approach works fine for a small number of categories and a small number of words. The same intuition is followed more elegantly using conditional probability.
Conditional Probability is again best understood with an example
Event A: The face value is odd | Event B: The face value is less than 4
P(A) = 3/6 (Favourable cases 1,3,5 Total Cases 1,2,3,4,5,6) similarly P(B) is also 3/6 (Favourable cases 1,2,3 Total Cases 1,2,3,4,5,6). An example of conditional probability is what is the probability of getting an odd number (A)given the number is less than 4(B). For finding this first we find the intersection of events A and B and then we divide by the number of cases in case B. More formally this is given by the equation
P(A|B) is the conditional probability and is read as the probability of A Given B. This equation forms the central tenet. Let’s now go back again to our book category problem, we want to find the category of the book more formally.
Let’s use the following notation Book=ML is Event A, book=Sports is Event B, and “Attribute 1 = 2 and Attribute 2 = 10” is Event C. The event C is a joint event and we will come to this in a short while.
Hence the problem becomes like this we calculate P(A|C) and P(B|C). Let’s say the first one has a value 0.01 and the second one 0.05. Then our conclusion will be the book belongs to the second class. This is a Bayesian Classifier, naive Bayes assumes the attributes are independent. Hence:
P(Attribute 1 = 2 and Attribute 2 = 10) = P(Attribute 1 = 2) * P(Attribute = 10). Let’s call these conditions as x1 and x2 respectively.
Hence, using the likelihood and Prior we calculate the Posterior Probability. And then we assume that the attributes are independent hence likelihood is expanded as
The above equation is shown for two attributes, however, can be extended for more. So for our specific scenario, the equation get’s changed to the following. It is shown only for Book=’ML’, it will be done similarly for Book =’Sports’.
Let’s use the famous Flu dataset for naive Bayes and import it, you can change the path. You can download the data from here.
Encoding the Data:
We store the columns in different variables and encode the same
# Collecting the Variables
y=nbflu.iloc[:,4]# Encoding the categorical variables
le = preprocessing.LabelEncoder()
y=le.fit_transform(y)# Getting the Encoded in Data Frame
X = pd.DataFrame(list(zip(x1,x2,x3,x4)))
Fitting the Model:
In this step, we are going to first train the model, then predict for a patient
model = CategoricalNB()
# Train the model using the training sets
predicted = model.predict([[1,0,0,1]])
Predicted Value: 
The output tells the probability of not Flu is 0.31 and Flu is 0.69, hence the conclusion will be Flu.
Naive Bayes works very well as a baseline classifier, it’s fast, can work on less number of training examples, can work on noisy data. One of the challenges is it assumes the attributes to be independent.
 Wu X, Kumar V, editors. The top ten algorithms in data mining. CRC Press; 2009 Apr 9.