Blog

Defending Against SpamBots

SPAM is THE four-letter of IT. Nothing makes users, developers and IT managers more annoyed than filtering this frivolous data coming into their systems. SPAM as it relates to email has some really good utilities that can limit a large amount of the unwanted messages while having relatively low false positive and false negative rates. Most of these utilities are so mature that you simply install them, configure a few settings and generally forget about it, with the filter taking care of everything.

Comment or Form Spam, on the other hand, does not have a drop-in type solution because of the level of integration a form has within the larger system. The field types and names of each form vary drastically when compared to the MIME headers of an email. Drop-in solutions have been attempted for form spam, however they often have limited success when they are run independent of more integrated methods.

The various form spam prevention methods can be grouped into one of 10 general categories.

Field Specificity & Validation

Field specificity and absolute validation of the replied values should be the first level of defense. Whenever a public form is created, you create the inputs with as much specificity as possible, when validate strictly against this specificity. The HTML5 specification has made this much easier with the expansion of the types of input fields.

Request & Response Header Validation

Taking a step back from the HTML side is the validation of the HTTP header of the request for the form html and the associated headers of the response to posting the form values. Running some basic checks of the HTTP headers can provide an early warning to the presence of a spambot.

Honeypots

Honeypots are a concept taken straight from email spam prevention and come in 2 types: honey pot fields and honey pot forms. Honeypots are basically a very tempting submission location that should never receive real data. Any submissions to the honeypot are automatically labeled as spam.

Form Expirations

Humans are inherently slower than computers when it comes to reading and filling out a form. Even simple login forms where everything is auto-completed and you just have to click the “login” button takes a second, while a computer can do it in milliseconds. More complex forms require even more time to for a human to read, understand and complete. Recoding the timestamp of the form request and requiring the response to occur with a set range makes the automatic completion of the form more expensive for a spambot.

Dynamic Fields

One of the things spambots often cannot do is run JavaScript. A simple preventative measure, therefore, is to dynamically create a form field via JavaScript that requires some kind of user interaction to pass the server-side validation. This can be as simple as a check box that the user needs to check to indicate they are human or a slider that needs to be moved to a specific position.

Request Based Field Names

Merging the underlying concepts behind honey pot fields, form expirations and dynamic fields creates request based field names. In this situation, every request has a unique set of field names, and the field names are validated against the source of the request. If the field names have been reused, the submission is deemed spam. This requires every submission of the field to be fetched individually, which often isn’t the case in spam bots. The parsing of the HTML requires significant processing power (from a computer or person) and would limit the cost effectiveness of spam, whose value proposition is often based upon volume.

CAPTCHAs

CAPTCHA is a backronym for “Completely Automated Public Turing test to tell Computers and Humans Apart” and is generally the bane of any user trying to submit a public form. The concept involves displaying an image containing characters and has the human retype the characters into a text box. Computers are supposed to not be able to understand the characters in the image while humans can easily understand the characters.

Hashcash

Hashcash is an iterative hash algorithm that requires the client (ie web browser) to repetitively hash a set of data (the serialized form fields) until a bitmask can be cleared. The iterative nature of a hashcash requires the web browser to expend a minimal amount of energy to get the correct output of the hash while the server simply needs to take the inputs and perform the hash once.

Blacklists & Keyword Filtering

Blacklists and keyword filter involves running regular expressions against the submitted content to extract html tags, urls, email addresses and specific keywords. The results of the regular expressions are checked against a blacklist of banned results, with any found results indicating a spammy submission. This method is strictly dependent upon the quality and completeness of the blacklist database.

Bayesian Analytics

Bayesian analysis is the basis for most of the good email spam filters. The overall principle is to run statistical analysis on the response headers and the posted values with respect to a database of known good content and spam content. The Bayesian analysis outputs a probability of the content being spam, which is then filtered against set level and discarded if the probability is too high. Bayesian analysis can be the most effective since it based upon the actual content of the form submission, the effectiveness is highly dependent upon the training against good and bad content. Also, Bayesian analysis is by far the most complex to implement and requires the most resources to run.

The source code required to implement some of these methods can be long and a little complex. So, over the next month, I will be publishing posts with more details on how to implement each of these protections as well as some notes on when each methodology should be implemented.

Defending Aganist Spambots - Field Specificity & Validation

Field specificity and absolute validation of the replied values should be the first level of defense. Whenever a public form is created, you create the inputs with as much specificity as possible, then validate strictly against this specificity. The HTML5 specification has made this much easier with the expansion of the types of input fields.

For example, if you are asking for a user’s age, use an input with ‘type=“number”’ and ‘step=”1” min=”5” max=”120”’ instead of a simple ‘type=”text”’. This forces the user to input an integer between 5 and 120 (max range of legitimate possible ages of a user) otherwise the form field should indicate it is an illegal value and prevent submission of the form. Then on the server side, you validate strictly against these criteria, immediately tossing any submission that contains an invalid value. There is an added bonus as the error messages for HTML5 compliant browsers don’t need to be as robust, since the user already should have received an error when they attempted to input the field the first time.

Example Validation Function

//! Validate input value of a Number Input //! @param string $input Inputted value //! @param int $min Minimum Value @default[0] //! @param int $max Maximum Value @default[100] //! @param string $step Incremental increase between minimum and maximum value @default[1] //! @success string Returns inputted value on success (including potentially 0) //! @failure FALSE Returns FALSE on validation failure function validateInputNumber($input, $min = 0, $max = 100, $step = 1) { // verify all inputs are numbers if(!is_numeric($input) || !is_numeric($min) || !is_numeric($max) || !is_numeric($step) ) { return FALSE; } // verify $input is within appropriate range if($input < $min || $input > $max) return FALSE; // check that $input is at a valid step position $inc = ($input - $min) / $step; if($inc != intval($inc)) return FALSE; // all checks passed, return $input return $input; } // example pass ($input == ’32.5’) $input = validateInputNumber(’32.5’, 0, 100, 2.5); // example fail ($input === FALSE) $input = validateInputNumber(’32’, 0, 100, 2.5);

A complete list of the HTML5 input fields can be found at MDN:

https://developer.mozilla.org/en-US/docs/Web/HTML/Element/input

Defending Aganist Spambots - Request & Response Header Validation

Taking a step back from the HTML side is the validation of the HTTP header of the request for the form html and the associated headers of the response to posting the form values. Running some basic checks of the HTTP headers can provide an early warning to the presence of a spambot.

Before serving the form HTML, the server validates that the request has the appropriate headers for a real browser. If the “Host”, “User-Agent” or “Accept” headers are not sent, it’s likely a cURL attempt to access the web page, and therefore a script to harvest and attack the form. This provides some basic security by obscurity, and as such should be viewed just as attack limiting approach, and not actual security of the form. An attacker can just as easily go to the actual page with a web browser, cut & paste the html into their script and attack the form. Limiting the display of the form limits the amount of this process done via scripts, particularly poorly written spambots.

The other side of the coin is the Response headers to the form page. In addition to checking for the headers required when initially serving the form, you should also check for the correct HTTP method (GET vs POST), and the “Cookie” (if applicable) and the “Referer” headers (yes, the referer header is misspelled). A real browser will never mess up the HTTP method and switch between GET and POST responses, while a spambot may default to the wrong method. Bots are also often mediocre about managing cookies, so the lack of a cookie field can be indicative of a spambot, or a paranoid user.

The “Referer” header should not be used conclusively to determine if the page was sent from a web browser. Some internet security suite and browser plugins will mess with the “Referer” header, either erasing it, or replacing it with the destination domain. Further, pages not served over TLS should not receive the “Referer” header when sent from a page served over TLS. (Forms served over TLS should never post to a page not served over TLS anyways.) Lastly, the HTML5 standard includes the HTML meta header ‘referer’ that can be set to ‘no-referrer’ where the browser is not supposed to send the referrer from the source page.

The last check that should be performed is the geolocation of the source IP address within the context of the form. For example, if the form is for online ordering of a pizzeria in Chicago, an request or submissions from an IP addresses geolocated in Australia has a very low probability of being legitimate.

There is one caveat of filtering based upon IP addresses: VPNs and Proxies. Smart phones in particular should be the biggest concern for most implementations, since the IP address of a phone on the network is often geolocated to the corporate headquarters rather than the location of the user.

Example HTTP Header Validation Function

//! Checks for HTTP Request Headers of a submitted form page //! @param string $formUrl URL of the form page //! @return bool Returns TRUE if all the checks passed, else FALSE function checkRequestHeaders($formUrl = ‘’) { // make sure $formUrl is a legit url if(!empty($formUrl) && !filter_var($formUrl, FILTER_VALIDATE_URL, FILTER_FLAG_SCHEME_REQUIRED & FILTER_FLAG_HOST_REQUIRED) { return FALSE; } // verify presence of basic headers if(empty($_SERVER[‘HTTP_USER_AGENT’]) || empty($_SERVER[‘REMOTE_ADDR’]) || empty($_SERVER[‘HTTP_ACCEPT’]) || empty($_SERVER[‘HTTP_ACCEPT_LANGUAGE’]) ) { return FALSE; } return TRUE; }

A complete list of the HTML5 input fields can be found on Wikipedia:

https://en.wikipedia.org/wiki/List_of_HTTP_header_fields

W3C Referrer Policy Working Draft

https://www.w3.org/TR/referrer-policy/

Defending Aganist Spambots - Honeypots

Honeypots are a concept taken straight from email spam prevention and come in 2 types: honey pot fields and honey pot forms. Honeypots are basically a very tempting submission location that should never receive real data. Any submissions to the honeypot are automatically labeled as spam.

Honey pot fields are fields within a form that should always be left blank and are indicated as such to the user via a label. When a form is submitted with that field completed, it can be quickly marked as spam, discarded and the submitter fingerprint recorded for tracking. In order to make the field tempting, the field name and field type should be chosen wisely. An input field with a name of “website” and a type of “url” is more tempting to a spambot than an input field with a name of “honeypot” and a type of “text”. Good spambots will detect the field type and name and try to inject appropriate content to bypass automated validation mechanisms.

Example Honey pot field

<style> form>div#form_hp { position: absolute; left:-99999px; z-index:-99999; } </style> <form method="POST" action=""> <div id="form_hp"> <label for="another_email">Leave this field blank</label> <input id="another_email" name="another_email" type="email" value=""/> </div> <!--- the real form content--> </form>

When hiding the honey pot field, the best method is to use embedded CSS to shift the field wrapper off the screen. A good quality bot will check to see which fields are natively displayed and only submit information to those displayed. Fields with “display:none” or “visibility:hidden” can be easily marked as hidden. Even situations where the field itself is absolutely positioned off screen can be detected without too much difficulty. Moving the wrapper off screen via CSS requires considerably more programming to detect, as all the CSS needs to be parsed and applied prior to evaluating the display nature of any field. The CSS should be embedded into the HTML to prevent loading issues, where an external CSS file is not loaded, and the wrapper, with the honey pot fields are displayed to the user.

Honey pot forms are entire forms that a real user should never find or submit information to, though are easily detected via automated scripts. Hidden links to the page containing the form are embedded in the footer or header and indicated that they should not be followed by bots. The page then contains a description that clearly states the form should not be used and a bunch of tempting fields to submit. Any submissions by this form are consequently deemed a bot and appropriate measures are taken. This type of honey pot can be integrated into a web-server layer filter (via a web application firewall like modsecurity) where the submissions are track prior to the application layer and attacks are mitigated at the web server.

The biggest concern with honey pot forms are search engines, and their bots finding the pages, and then displaying the page as a result in search results. Appropriate steps should be taken to minimize bots taking the honeypot links via usage of the rel=”nofollow” attribute in the hidden links, the ‘’ tag in the html head section of the form page and clear text on the page saying not to submit this form.

Defending Aganist Spambots - Form Expirations

Humans are inherently slower than computers when it comes to reading and filling out a form. Even simple login forms where everything is auto-completed and you just have to click the “login” button takes a second, while a computer can do it in milliseconds. More complex forms require even more time to for a human to read, understand and complete. Recoding the timestamp of the form request and requiring the response to occur with a set range makes the automatic completion of the form more expensive for a spambot.

The timestamp can be sent along with the normal form fields as a hidden input field, so long as the validity of the timestamp is checked when validating the form submission. The easiest method is a HMAC check with a server-specific key. This actually allows for integration of additional data into the timestamp field, like the requester’s IP address and User agent.

Example Creation and Validation of a Form Timestamp

// globals $gvServerPublicFormKey = '5f4dcc3b5aa765d61d8327deb882cf99'; //! Create Timestamp Field //! @return string HTML of timestamp field function createTimestampField(){ global $gvServerPublicFormKey; // get current unix timestamp $t = time(); // compose raw value as time:IP:user agent $data = $t . ':' . $_SERVER[‘REMOTE_ADDR’] . ':' . $_SERVER['HTTP_USER_AGENT']; // generate HMAC hash $hash = hash_hmac('sha512', $data, $gvServerPublicFormKey); // build input $html = ""; return $html; } //! Validate Timestamp Input //! @param int $min Minimum delay time in seconds @default[5] //! @param int $max Maximum delay time in seconds @default[1200] //! @returns bool Returns the validity of the timestamp input field function validateTimestampInput($min = 5, $max = 1200) { global $gvServerPublicFormKey; $t = 0; $hash = ''; // field field foreach(($_REQUEST as $key => $val) { if(strpos($key, 'ts-') !== 0) continue; $t = substr($key, 3); // validate potential timestamp value if(!$t || intval($t) != $t || $t + $min < time() || $t +$max > time() ) { continue; } $hash = $val; break; } // potentially valid timestamp not found if(!$hash) return FALSE; // generate hash based upon timestamp value $data = $t . ':' . $_SERVER['REMOTE_ADDR'] . ':' . $_SERVER['HTTP_USER_AGENT']; $correctHash = hash_hmac('sha512', $data, $gvServerPublicFormKey); // return validity of hmac hash return hash_equals($hash, $correctHash); }

Defending Aganist Spambots - CAPTCHAs

CAPTCHA is a backronym for “Completely Automated Public Turing test to tell Computers and Humans Apart” and is generally the bane of any user trying to submit a public form. The concept involves displaying an image containing characters and has the human retype the characters into a text box. Computers are supposed to not be able to understand the characters in the image while humans can easily understand the characters.

This worked well in 1997, when the concept was developed, but the advances in image processing have required the images to become more and more obscured from simple plain text. Adding colors, lines, as well as messing with the shapes of the letters obscure image processing applications from detecting the text. This obscurity also makes it challenging for anyone with visual impairments to read the text and get the CAPTCHA response correct.

The user experience issues make CAPTCHAs make them an undesirable solution to spambots, but one that can be implemented when the other solutions are inadequate. UX focused sites often use CAPTCHA only in situations where other protections have returned multiple failures though the system does not want to prevent a potentially legitimate user from accessing the material. These are situations like password resets, login screens, account creations and search pages.

 Integration of a CAPTCHA solution involves either integrating a third-party source into your form, or generating the images yourself. Generating the images locally via an image manipulation library sounds like a good, cheap method for using implementing CAPTCHA , however there has been a significant effort placed on defeating the protection and everything you can think of doing to the text to prevent analysis, but still be readable by a human has been reverse-engineered. Good CAPTHCHA solutions test their image database against the best tools on a regular basis, eliminating those images defeated by the analysis tools. Consequently, homebrew CAPTCHA’s are often little better than having no protection while providing a noticeable depredating in the user experience.

Integrating a third-party solution generally involves embedding a JavaScript in your form which fetches the image and a unique ID code from the provider’s servers. The user then provides you with the plain text version and you check this, along with the image ID code which was submitted as a hidden form field, with the provider to get a pass or failure response. All of the good CAPTCHA providers have nice clear documentation about this process and attempt to make it as easy as possible to integrate their solution.

I have avoided CAPTCHAs primarily due to the poor user experience factor. Different combinations of the other methods, especially the hashcash and the Bayesian analysis have provided good protection so far.

Defending Aganist Spambots - Dynamic Fields

One of the things spambots often cannot do is run JavaScript. A simple preventative measure, therefore, is to dynamically create a form field via JavaScript that requires some kind of user interaction to pass the server-side validation.

Initially this concept was applied to a simple check box that had the label “Check if you are human.” Spambots would not create nor check the box and the presence of the checkbox field was used to determine if the form was submitted via a human.

More advanced spambots utilize the V8 JavaScript engine and can mimic the loading of the page, where the dynamic field is created. The bot then would use this dynamically created DOM as the source to pull the form element, and the associated field names, to be submitted. This level of sophistication is relatively rare in comment-spam bots, but for spambots focused on user account forms (login, password reset and account setup) it is being more common due to the increase value associated bypassing these form’s validation methodologies.

The big caveat with this defense is the 10% or so, of users who have JavaScript disabled will never see the dynamic field and submit the form without the field just like a spambot. An alternative method to JavaScript creating the fields is to use the new HTML5 input field range and have the user move the slider from left to right, or to the center, depending on the instructions in the associated label. This only works for newer browsers, but helps reduce some of that 10%.

Request Based Field Names

Merging the underlying concepts behind honey pot fields, form expirations and dynamic fields creates request based field names. In this situation, every request has a unique set of field names, and the field names are validated against the source of the request. If the field names have been reused, the submission is deemed spam. This requires every submission of the field to be fetched individually, which often isn’t the case in spam bots. The parsing of the HTML requires significant processing power (from a computer or person) and would limit the cost effectiveness of spam, whose value proposition is often based upon volume.