Filtering spam and classify text using Naive Bayes Classifier

Implementation of Naive Bayes Classifier and filtering spam emails, what is Naive Classifier and Baye’s theorem, how to classify text using Naive Bayes Classifier


Category: Machine Learning Tags: Python, Python 3

Naive Bayes Classifier Code Files

Introduction

    If you have seen spam folder of your Gmail account you might have wondered how Gmail classify emails for spam and inbox. In this article, we will learn how we can classify a document using example of spam filtration.

Features: While considering a text, we call words as features, set of features are more likely to appear in spam than non-spam.

Category: Category is used to represent classification, in this article we will use ‘good’ and ‘bad’ two categories. A document may fall in one of these two else it will be categorized as ‘unknown’.

Implementation

     Let’s create a class called classification and some methods like below:

import re

class classifier:
    def __init__(self, categoryThreshold):
        #Holds features like {'money': {'good':1,'bad':2}}
        self.featureCategoryCount = {}
        #Holds category like {'good':5,'bad':8}
        self.categoryCount = {}
        #Holds category threshold like {'good':1,'bad':3}
        self.categoryThreshold = categoryThreshold


    #Breaks document and returns features
    def getFeatures(self, text):
        splitter  = re.compile('\\W*')
        words = [word.lower() for word in splitter.split(text) if len(word) > 2 and len(word) < 20]

        return dict([(word, 1) for word in words])

    #Adds new feature
    def addFeature(self, feature, category):
        self.featureCategoryCount.setdefault(feature, {})
        self.featureCategoryCount[feature].setdefault(category, 0)
        #Incrementing feature's category value
        self.featureCategoryCount[feature][category] += 1

    #Adds new category
    def addCategory(self, category):
        self.categoryCount.setdefault(category, 0)
        #Incrementing category's value
        self.categoryCount[category] += 1

    #trains our algorithm
    def trainClassifier(self, text, category):
        features = self.getFeatures(text)
        for feature in features:
            self.addFeature(feature, category)

        self.addCategory(category)

Above we have three dictionaries initialized in constructor, one is to maintain feature’s good and bad count, suppose for a feature ‘money’ it will be:

{'money': {'good':1,'bad':2}}

Second one is to maintain overall count of good and bad like:

{'good':5,'bad':8}

And third one maintain threshold like:

{'good':1,'bad':3}

Threshold shows how many times a category has to be more than other category to be eligible. Suppose probability of good is 2 and bad is 7 then bad is more than 3 times of good so document will be categorized as bad (because threshold of bad is 3). If good is more than bad then document will be categorized as good (because threshold of good is 1). If bad is more than good but not as much three times of good will be categorized as unknown.

We are doing it because an important email should not be filtered as spam, if a spam filtered as non-spam still it is fine.

Above we have some methods where one is to train classifier, which internally add features and categories. Now we will create a method to calculate probability of a category:

#Calculats probability
def simpleProbability(self, feature, category):
    if feature not in self.featureCategoryCount or \
        category not in self.featureCategoryCount[feature] or \
        self.featureCategoryCount[feature][category] == 0:
        return 0;
    else:
        # Probability = Number of favorable outcomes/Total number of possible outcomes
        return float(self.featureCategoryCount[feature][category]) / float(self.categoryCount[category])

We used formula:

Probability = Number of favorable outcomes/Total number of possible outcomes

But above method might not give best probability initially because we won’t be having trained our algorithm with good amount of data. We can solve this problem by guessing initial probability to some number. 0.5 is a good number to start with and then we can calculate weighted probability:

#Calculates weighted probability
def weightedProbability(self, feature, category, weight=1.0, assumedProbability=0.5):
    probability = self.simpleProbability(feature, category)

    #Number of times feature appeared in all categories
    total = sum([self.featureCategoryCount[feature][cat] \
                    if feature in self.featureCategoryCount \
                    and cat in self.featureCategoryCount[feature] else 0 \
        for cat in self.categoryCount])

    return float(weight * assumedProbability + total * probability)/float(weight+total)

Weighted probability formula:

Weighted Probability = (weight*assumedProbability + total*probability)/(weight + total)

Let’s test above method for feature ‘quiz':

threshold = {'good':1,'bad':3}
obj = classifier(threshold)
#Training algorithm
obj.trainClassifier('You won 100000$', 'bad')
obj.trainClassifier('Your credit card worth limit 5 lacs has been dispatched', 'bad')
obj.trainClassifier('Lifetime Free Membership', 'bad')
obj.trainClassifier('An over-due inheritance claim!!!', 'bad')
obj.trainClassifier('Get instant personal loan with zero paper work', 'bad')
obj.trainClassifier('Security alert', 'good')
obj.trainClassifier('Search job', 'good')
obj.trainClassifier('Latest news', 'good')
obj.trainClassifier('Javascript Framework Challenge', 'good')
obj.trainClassifier('Save 30% on new orders', 'good')
obj.trainClassifier('You won quiz', 'good')

print(obj.weightedProbability('quiz', 'good'))

Output:

0.333333333333

We will use above method to calculate probability of a feature but we want to calculate probability of whole document and to do that we will multiply probability of all features in document:

#Calculates document|category probability Pr(document|category)
def documentProbability(self, text, category):
    features = self.getFeatures(text)

    docProbability = 1
    #Multiply all feature probability in a category
    for feature in features:
        docProbability *= self.weightedProbability(feature, category)

    return docProbability

Above method gives probability Pr(document | category) which is conditional probability but we need Pr(category | document) to find in which category a document fits best. We will use Baye’s theorem:

Pr(A | B) = Pr(B | A) * P(A)/P(B)

So

Pr(category | document)  = Pr(document | category) * Pr(category)/Pr(document)  

We will ignore Pr(document) here because it will become constant for all categories and when we compare it with others we can ignore constants.

#Calculates category|document probability Pr(category|document) = Pr(document|category) * Pr(category)/Pr(document)
def categoryProbability(self, text, category):
    docProb = self.documentProbability(text, category)
    catProb = float(self.categoryCount[category] if category in self.categoryCount else 0)/float(sum([self.categoryCount[cat] for cat in self.categoryCount.keys()]))
    #Ignoring Pr(document) since it will become constant accross categories in same document
    return docProb*catProb

Now we will create method classify which will return category of document

#Classifies a document
def classify(self, text, default='unknown'):
    probabilities = {}
    max = 0
    for cat in self.categoryCount.keys():
        probabilities[cat] = self.categoryProbability(text, cat)
        if(probabilities[cat] > max):
            max = probabilities[cat]
            bestCat = cat

    for cat in probabilities.keys():
        if cat != bestCat:
            if probabilities[cat]*self.categoryThreshold[bestCat] > probabilities[bestCat]:
                return default
    return bestCat

Now let’s run this method:

threshold = {'good':1,'bad':3}
obj = classifier(threshold)
#Training algorithm
obj.trainClassifier('You won 100000$', 'bad')
obj.trainClassifier('Your credit card worth limit 5 lacs has been dispatched', 'bad')
obj.trainClassifier('Lifetime Free Membership', 'bad')
obj.trainClassifier('An over-due inheritance claim!!!', 'bad')
obj.trainClassifier('Get instant personal loan with zero paper work', 'bad')
obj.trainClassifier('Security alert', 'good')
obj.trainClassifier('Search job', 'good')
obj.trainClassifier('Latest news', 'good')
obj.trainClassifier('Javascript Framework Challenge', 'good')
obj.trainClassifier('Save 30% on new orders', 'good')
obj.trainClassifier('You won quiz', 'good')

print(obj.classify('You won 20000$'))
print(obj.classify('credit card offer'))
print(obj.classify('personal loan'))
print(obj.classify('Get instant personal loan with zero paper work'))

Output:

good

unknown

unknown

bad

Conclusion

    We have seen output categorizes the documents, as much we train our algorithm that much it’s filtration accuracy will increase. unknown category can be considered good initially and we can use output to train algorithm with real time data and accuracy will be improved as emails it receives.


Like 0 People
Last modified on 21 October 2018
Nikhil Joshi

Nikhil Joshi
Ceo & Founder at Dotnetlovers
Atricles: 126
Questions: 9
Given Best Solutions: 8 *

Reference:

programming collective intelligence - by Toby Segaran and published by O'Reilly

Comments:

No Comments Yet

You are not loggedin, please login or signup to add comments:

Existing User

Login via:

New User



x