Bayes Theorem and Naive Bayes | alexhwoods

One thing I love to do to practicemachine learning is code an algorithm myself. I mean without external packages, just trying toimplement all the math and everything else myself.

Naive Bayes is a great algorithm for this, because it’s relatively conceptually easy, and it gets pretty decent results. But before we can talk about Naive Bayes, we have to talk about conditional probability.

Bayes Theorem

To understand most statistics, we need to understand conditional probability. Bayes theorem states that

where P(A|B) is the probability that A is true,giventhat B is true. Most analysis these days is Bayesian, as it is a way to inject practical subjectivity into a model.

Naive Bayes

Naive Bayes is an algorithm using basic probability in order to classify. Lets define a few probabilities in the context of an example so its more clear.In this example we are classifying students as accepted or rejected to some school (the data is from UCLA!).

The prior – The probability of the class. In this case, if A = accepted, then

where N_A = number of accepted students in the data, and N = number of total students in the data.

Likelihood – This is a conditional probability. If we assume some student (some vector x) is accepted, then we can calculate the probability of him having a gpa as low or as high as he did, given he was accepted.

If we have to calculate this for real values, we use a Normal distribution. If we are doing it for binaryvalues, a Bernoulli distribution, and for categoricalmultinoulli. We can use different distributions for different variables, each within our model.

Posterior – The probability of the prior and the likelihood, normalized. In math this looks like the prior x the likelihood, divided by the probability of the vector x. When were writing a Naive Bayes algorithm however, we dont care about the normalizing, or the dividing by the probability of vector x. This is because it is constant and has no impact no our classification. So the denominator in both of these equationsignore it.

When calculating the likelihood, we have to do it for each feature. We can simply multiply the probabilities, because the model assumes conditional indepence, a strong, likely false assumption. This is why the algorithm is called “naive”.

Example

Which students get accepted to UCLA?

We have two classes – accepted and rejected. We have three variables gre score, gpa, and rank. We can use each of these to write a classifier.

Notes –

Im using R. I chose it over Python for this one because it felt so statistical, but you could just as easily do this in Python.

train is the training data, which I randomly sampled by class.

Prior

First we need to calculate the prior probabilities for each class. My favorite way to do this is a subset. If you want to save memory, compress this into two lines.

note – sorry it’s a picture and not code! I really dislike WordPress for code snippets.

Likelihood

I go on to test this function and make sure it works properly. It does.

Posterior

And lastly, we need to predict for each value in the test data! We do this simply by taking the maximum posterior probability.

One last thing – Its been a while since Ive posted; Ive been really busy with school. Ill try to write more in the next few months! :)