Content Based Spam email Filtering

pravat kc
Jan 3, 2022
7 min read

An Email contains texts, URLs, images and videos as its contents. Problem of spam in email communication has been a serious issue for a long time now. An attacker can use this to redirect users to phishing websites. Thus, to prevent malicious attacks from happening Email providers use various types of Spam Email filtering based on the contents of the email. With evolving spam filters, the spamming techniques are evolving as quickly. Spam filtering technologies use algorithms that classify emails based on its contents mutually comparing it with its standard inputs putting the features in subsets and classifying emails based off of the subsets of features. The paper is based on research based on data mining, machine learning and multilevel approach for Spam email classification and its filtering. The paper also talks about pre-processing of the emails and its separation from the homogenous work emails.

As the world is rapidly moving towards the digital era, the use of emails of been very essential. Even with other modes of communication increasing in rapid pace E-mail remains as one of the most used form of official communication. With high popularity E-mail also has high risk from the attackers and spammers specially because it is mostly used for official use to exchange important information. Spam emails are the biggest threat to E-mail communication, they are junk emails or unsolicited bulk messages that contains malicious URLs or image links or just plain text with no significant information in them. Since Emails are so commonly used more and more protective methods to protect emails have been created like stricter policy, stronger algorithms against server-based attacks and Spam filtering. This paper talks about types of Spam filters out there and dives in details about content-based spam filtering focusing on some major filtering theories and algorithms like Bayesian filters and other heuristics.

Spam Filters

As the name suggests Spam filters are programs or techniques that are used to detect Spam emails and segregate it from genuine emails. Various methods have been developed over the years to filter spam emails such as Blacklist/Whitelist, Word-Based Filters, Heuristic filters, content-based filters. Newer research has shown that machine learning can be incorporated in the spam filter to make it more affective. Most of the spam filters are moving in the path of machine learning and Artificial Intelligence these days. Various researchers around the world were able to successfully experiment in filters that could detect spam and ham in E-mails that were written in other languages with good efficiency. Various common methodologies and techniques used with Spam filters are explained below.

Blacklist/Whitelist/Greylist
Heuristic Filters
Content-Based Filters

Bayesian Algorithm and its working

Traditional heuristic based filters did a particularly bad job because it would have a lot of false positive with a lot of the legitimate emails ending up in the spam section. Using Bayesian filter one can look into the words that are used in the email and give it a probability that they are spam given that those words appear.

Bayesian Algorithm

It is a machine learning algorithm for classification problems which is primarily used for text classification and involves high-dimensional training data sets.

Hierarchical Framework

The four filtering layers are formed in a hierarchical manner and are managed by the controlling unit. All the layers in the filtering module are capable of filtering decisions except the OCR in addition to making filtering decisions, it resorts to text classification layer for further analysis as shown in Figure [1]. Optical Character Recognition or OCR is a technique where the filter analyses the text that are embedded in images. Since there have been various challenges using just OCR in analyzing text it is bonded with Text classification layer in this framework [5].

Conclusion

As one of the most common ways of communication, Emails are an integral part of day-to-day life. Spams are becoming very serious issue in email communication considering how common spam attacks are these days. As an effort to push back such increasing number of attacks in E-mail communication various spam filtering methods are used. And through out the history of spam filters they have been successful. But with ever so evolving filters, the attackers and spammers are also focused on modifying the filters such that to defeat the Email spam filters.

Spam filtering can also be used to prioritize email along with classifying them as Spam and Ham. Meaning that it can prioritize to filter out the emails based on the level of danger it posseses to the victim system. This is just one of the functions but is rarely used. One of the most common and powerful technique used as spam filter is content based spam filter. As discussed in the paper, Bayesian algorithm has been the favorite algorithm used by researchers who are working on content based filters. Mostly because Bayesian algorithm computes in reference to the prior probability and posterior probability. These help the algorithm come closer and closer to accurate results with the help of each previous iteration. Even without knowing the actual position or the value of the content with relative knowledge of the surrounding substance, Bayesian theorem helps the researcher predict the target.

Machine learning has come a long way alongside the spam filters. All the spam filters available now are automated and mostly fed through a machine learning environment which is constantly reading data from one part of the filter to another. The machine learning component is constantly evolving with each filter cycle. After each generation of report or after each decision that it makes either to declare the email is Spam or Ham, the machine learning component is reading it and storing it for future references. In attempts to defeat even the filters with Machine learning spammers use grammatical errors or just add a chunk of email in the spam email where that chunk looks like a legitimate email. Where say the word “Viagra” is typed as “V!agra” or “V1agra” in an attempt to fool the filtering algorithm. With machine learning it is difficult to get through the filter with simple typing tricks like mentioned above. Filtering algorithms make notes of repetitive words, phrases and even can predict word followed by known word in a certain type of email. All the parameters in the machine learning algorithm, called the features, go hand in hand with the filter model to act as a strong spam filter.

We also found out various types of architectures used for spam filters such as a Three-tier architecture or even the hierarchical architecture framework. In a three-tier architecture, the system mainly focusses on categorizing emails in two categories, namely spam and non-spam [8]. And in Hierarchical framework, the architecture consists of two module, the filter module with 4 major layers and Data module which acts like the data storage where all the previous records or the reference legitimate Emails and the rules and protocols that the filter follows are kept. Paper [5] talks about such model and it is one of the few models which has dedicated layer to filter out embedded texts from images and analyze it.

Content based spam detection have various advantages over other spam filtering and detecting methods because content based methods can not only detect the spam keywords based on the rules or data set that they are pre-fed with but they can detect stuff like spam URL without harming the server or the victim machine which the email is intended to be. [1] talks about the URL detection in their paper and claims that the experiment was successful and using Bayesian Classifier gives about 94.86% accurate results. Along with URL detection content based filters are also capable of detecting spam text that might be embedded in images. We saw a module talking about OCR and dedicating that layer of the filter just to detect and analyze text embedded in images. Content based filters are also capable of learning phrases that might be commonly used not just he words that are being stored in the database and as a whole compare it to a legitimate email to find correlation between those two.

The research paper is basically talking about different types of content-based filters, different types of algorithms tried by various researchers and what conclusion they came up with or what they thought was the best most effective algorithm to use for a content-based filter. The researchers have also found that most of the spammers use automated bots to create and send Spam emails and have predictable features in most of the cases. The algorithm that we use to filter such emails can be fed with information and used to predict the behaviors such that it can be detected as Spam. Email servers and services these days are investing heavily on spam emails. Email becoming on of the most important form of communication these days are under constant threats. Services like Gmail has aggressive kind of filters that make sure slightest of suspicion would send the email away from the inbox folder. Services like that also allows user to modify their spam filter in their end and help the filter. Users are allowed to create Whitelist and Blacklist, some filters also allows user to input phrases that they don’t want to see in their email inbox so that the filter tags it and throws it in the spam folder. [1] even talks G-mail specific examples where they take data sets of sample gmail service emails and then perform experiments to see if the URLs in the emails were captured by the spam filter. They use a sort of Decision tree with Bayesian algorithm to derive performance on the basis of parameters [1].

You can download my complete report here