Thursday, 21 May 2015

Naive Bayes (Sentiment Polarity dataset) PHP Implementation


Data Mining Classification Algorithm

Naive Bayes Coding in PHP, 


Naive Bayes

Bayesian classifiers are based around the Bayes rule, a way of looking at conditional probabilities that allows you to flip the condition around in a convenient way. A conditional probably is a probably that event X will occur, given the evidence Y. That is normally written P(X | Y). The Bayes rule allows us to determine this probability when all we have is the probability of the opposite result, and of the two components individually: P(X | Y) = P(X)P(Y | X) / P(Y). This restatement can be very helpful when we're trying to estimate the probability of something based on examples of it occurring.
In this case, we're trying to estimate the probability that a document is positive or negative, given it's contents. We can restate that so that is in terms of the probability of that document occurring if it has been predetermined to be positive or negative. This is convenient, because we have examples of positive and negative opinions from our data set above.
The thing that makes this a "naive" Bayesian process is that we make a big assumption about how we can calculate at the probability of the document occurring: that it is equal to the product of the probabilities of each word within it occurring. This implies that there is no link between one word and another word. This independence assumption is clearly not true: there are lots of words which occur together more frequently that either do individually, or with other words, but this convenient fiction massively simplifies things for us, and makes it straightforward to build a classifier.
We can estimate the probability of a word occurring given a positive or negative sentiment by looking through a series of examples of positive and negative sentiments and counting how often it occurs in each class. This is what makes this supervised learning - the requirement for pre-classified examples to train on.
So, our initial formula looks like this.
P(sentiment | sentence) = P(sentiment)P(sentence | sentiment) / P(sentence)
We can drop the dividing P(line), as it's the same for both classes, and we just want to rank them rather than calculate a precise probability. We can use the independence assumption to let us treat P(sentence | sentiment) as the product of P( token | sentiment) across all the tokens in the sentence. So, we estimate P(token | sentiment) as
count(this token in class) + 1 / count(all tokens in class) + count( all tokens )
The extra 1 and count of all tokens is called 'add one' or Laplace smoothing, and stops a 0 finding it's way into the multiplications. If we didn't have it any sentence with an unseen token in it would score zero. We have implemented the above in the classify function of the following class:
We're implementing this in PHP in the classify function: