# How does a Spam Classifier Work?

Did you ever wonder how the **spam classifier** in your email works? How does it know that the email might be spam or not?

One popular technique is something called, **Naive Bayes**, and that’s an example of a *Bayesian method*, so let’s learn more about how that works.

### Bayes Theorem

The probability of a, given b, is equal to the overall probability of times the probability of b given a. Over the overall probability of b.

We can actually build a spam classifier for that, an algorithm that can actually analyze a set of known spam emails and a known set of known spam emails and *train the model to actually predict* whether new emails are spam or not.

#### Example of a Spam Classifier

Most people promising you “**won”** stuff, it’s probably spam. So let’s work that out, the probability of being spam, given that you have the word, won, in an email. Works out to the overall probability of it being a spam message times the probability of containing the word, won, given that it’s spam, over the probability overall of being won.

Now the numerator can just be thought of as the probability of a message being spam and containing the word, won. But that’s a little bit different than what we’re looking for because that’s the odds out of the complete data set and not just the odds within things that contain the word, won.

The denominator, just the overall probability of containing the word, won.

#### What about other words other than “Won”?

Our spam classifier should know about more than just the word **“Won”**, it should automatically pick up every word in the message and figure out how much does that contribute to the likelihood of this email being spam.

So what we can do, is *train our model and every word* that we encounter during training. And then when we go through all the words in the new email, *multiply the probability of being spammed* for each word together, and then we get the overall probability of that email being spam.

### Impementation of Spam Classifier

**Importing Libraries**

import pandas as pd

from sklearn.feature_extraction.text import CountVectorizer

from sklearn.naive_bayes import MultinomialNB, GaussianNB

from sklearn import svm

from sklearn.model_selection import GridSearchCV

2. **Read Data and assign it to Variables**

dataframe = pd.read_csv("spam.csv")

x = dataframe["EmailText"]

y = dataframe["Label"]

print(dataframe.describe())

3. **Split Data into Train and Test Sections**

x_train,y_train = x[0:4457],y[0:4457]

x_test,y_test = x[4457:],y[4457:]

4. **Using Count Vectorizer to calculate feature numbers**

cv = CountVectorizer()

features = cv.fit_transform(x_train)

5. **Building a Model**

tuned_parameters = {'kernel': ['rbf','linear'], 'gamma': [1e-3, 1e-4],'C': [1, 10, 100, 1000]}

model = GridSearchCV(svm.SVC(), tuned_parameters)

model.fit(features,y_train)

6. **Testing Model**

print(model.best_params_)

7. **Evaluating Model**

print(model.score(cv.transform(x_test),y_test))

Output

### Use of Naïve Bayes in Spam Classifier

It is a **classification** technique based on **Bayes**’ Theorem. It's called naive because we’re *assuming that there are no relationships between the words* themselves.

We’re just *looking at each word in isolation*, individually within a message and basically combining all the probabilities of each word, contribution to it being spam or not.

This sounds tough but,

in Python makes this actually pretty easy to do.scikit-learn

That’s Naive Bayes in action and you can actually go and classify some spam or ham messages now that you have that cleared.

Thank you for reading this post, I hope you enjoyed and learn something new today. Feel free to contact me through my blog if you have questions, I will be more than happy to help.

Stay safe and Happy learning!