Have you ever asked yourself how Gmail filters the emails you get into categories without any interaction from you?
Well, Google uses Machine Learning in this process to help improve the interface of your email list. This is done in order to optimize the emails in useful categories instead of making them all in a single category.
Email Categorization Methods:
The categorization process or filtering is done through some techniques:
1- Content-based filtering technique
Through Machine Learning algorithms like Naïve Bayesian we can create automatic filtering rules. Those algorithms analyze the content of the email (words), distribution of words over email, and calculate the number of occurrences for each word.
The next step is to run generated rules to filter the emails. For example filtering promotional emails that contain offers or deals.
Some other ML algorithms can be used too are Support Vector Machine (SVM), K- Nearest Neighbour (KNN), or Neural Networks (NN).
2- Case-based Spam Filtering
In this filtration method, all emails are extracted from both spam and non-spam folders in the user’s mailbox. The second step is to get email data through the client interface. Then, the data is classified into two vectors. Finally, the ML algorithm is trained and tested on these datasets in order to decide if the incoming emails are spam or not.
3- Rule-based spam filtering technique
This technique uses a huge number of patterns like regular expressions against a specific message. The regular expression is a sequence of characters that defines a specific search pattern.
For example, if we want to collect the phone numbers attached in the footer of the email, there’s a specific pattern for phone numbers in egypt as it starts with +20 followed by 10 numbers. Other patterns can detect variou types of content.
In our spam filter technique, we calculate the score of the email as the number of similar patterns matched in it and deduct from score for any unmatched patterns. If the score goes higher than a specific value, the email is filtered as a spam.
4- Previous likeness-based spam filtering Technique
This approach uses k- nearest neighbour (KNN) which is a ML algorithm using memory to classify emails according to resemblance to training data.
The training data here is the emails previously labeled that ‘s used for the learning phase.
The algorithm is easily creating a multi-dimensional space vector with points, each point represents a specific class. When a new point is tested, the algorithm is classifying it to its nearest points and allocated to the most points group.
5- Adaptive spam filtering technique
This technique uses spam emails data and groups them into different classes. It divides the email corpus into parts, each part has its symbolic text.
The next step is to compare each incoming email with each part. The email is labeled as a spam if a specific percentage of similarity is calculated.
We all know now that ML doesn’t have a 100% accuracy in labeling emails, but with time and data, it can be improved in the future, giving us the best experience possible.
Now, open your Gmail and take a look with a different mindset this time 🙂