SPAM is THE four-letter of IT. Nothing makes users, developers and IT managers more annoyed than filtering this frivolous data coming into their systems. SPAM as it relates to email has some really good utilities that can limit a large amount of the unwanted messages while having relatively low false positive and false negative rates. Most of these utilities are so mature that you simply install them, configure a few settings and generally forget about it, with the filter taking care of everything.
Comment or Form Spam, on the other hand, does not have a drop-in type solution because of the level of integration a form has within the larger system. The field types and names of each form vary drastically when compared to the MIME headers of an email. Drop-in solutions have been attempted for form spam, however they often have limited success when they are run independent of more integrated methods.
The various form spam prevention methods can be grouped into one of 10 general categories.
Field Specificity & Validation
Field specificity and absolute validation of the replied values should be the first level of defense. Whenever a public form is created, you create the inputs with as much specificity as possible, when validate strictly against this specificity. The HTML5 specification has made this much easier with the expansion of the types of input fields.
Request & Response Header Validation
Taking a step back from the HTML side is the validation of the HTTP header of the request for the form html and the associated headers of the response to posting the form values. Running some basic checks of the HTTP headers can provide an early warning to the presence of a spambot.
Honeypots are a concept taken straight from email spam prevention and come in 2 types: honey pot fields and honey pot forms. Honeypots are basically a very tempting submission location that should never receive real data. Any submissions to the honeypot are automatically labeled as spam.
Humans are inherently slower than computers when it comes to reading and filling out a form. Even simple login forms where everything is auto-completed and you just have to click the “login” button takes a second, while a computer can do it in milliseconds. More complex forms require even more time to for a human to read, understand and complete. Recoding the timestamp of the form request and requiring the response to occur with a set range makes the automatic completion of the form more expensive for a spambot.
Request Based Field Names
Merging the underlying concepts behind honey pot fields, form expirations and dynamic fields creates request based field names. In this situation, every request has a unique set of field names, and the field names are validated against the source of the request. If the field names have been reused, the submission is deemed spam. This requires every submission of the field to be fetched individually, which often isn’t the case in spam bots. The parsing of the HTML requires significant processing power (from a computer or person) and would limit the cost effectiveness of spam, whose value proposition is often based upon volume.
CAPTCHA is a backronym for “Completely Automated Public Turing test to tell Computers and Humans Apart” and is generally the bane of any user trying to submit a public form. The concept involves displaying an image containing characters and has the human retype the characters into a text box. Computers are supposed to not be able to understand the characters in the image while humans can easily understand the characters.
Hashcash is an iterative hash algorithm that requires the client (ie web browser) to repetitively hash a set of data (the serialized form fields) until a bitmask can be cleared. The iterative nature of a hashcash requires the web browser to expend a minimal amount of energy to get the correct output of the hash while the server simply needs to take the inputs and perform the hash once.
Blacklists & Keyword Filtering
Blacklists and keyword filter involves running regular expressions against the submitted content to extract html tags, urls, email addresses and specific keywords. The results of the regular expressions are checked against a blacklist of banned results, with any found results indicating a spammy submission. This method is strictly dependent upon the quality and completeness of the blacklist database.
Bayesian analysis is the basis for most of the good email spam filters. The overall principle is to run statistical analysis on the response headers and the posted values with respect to a database of known good content and spam content. The Bayesian analysis outputs a probability of the content being spam, which is then filtered against set level and discarded if the probability is too high. Bayesian analysis can be the most effective since it based upon the actual content of the form submission, the effectiveness is highly dependent upon the training against good and bad content. Also, Bayesian analysis is by far the most complex to implement and requires the most resources to run.
The source code required to implement some of these methods can be long and a little complex. So, over the next month, I will be publishing posts with more details on how to implement each of these protections as well as some notes on when each methodology should be implemented.