Taking a step back from the HTML side is the validation of the HTTP header of the request for the form html and the associated headers of the response to posting the form values. Running some basic checks of the HTTP headers can provide an early warning to the presence of a spambot.
Before serving the form HTML, the server validates that the request has the appropriate headers for a real browser. If the “Host”, “User-Agent” or “Accept” headers are not sent, it’s likely a cURL attempt to access the web page, and therefore a script to harvest and attack the form. This provides some basic security by obscurity, and as such should be viewed just as attack limiting approach, and not actual security of the form. An attacker can just as easily go to the actual page with a web browser, cut & paste the html into their script and attack the form. Limiting the display of the form limits the amount of this process done via scripts, particularly poorly written spambots.
The other side of the coin is the Response headers to the form page. In addition to checking for the headers required when initially serving the form, you should also check for the correct HTTP method (GET vs POST), and the “Cookie” (if applicable) and the “Referer” headers (yes, the referer header is misspelled). A real browser will never mess up the HTTP method and switch between GET and POST responses, while a spambot may default to the wrong method. Bots are also often mediocre about managing cookies, so the lack of a cookie field can be indicative of a spambot, or a paranoid user.
The “Referer” header should not be used conclusively to determine if the page was sent from a web browser. Some internet security suite and browser plugins will mess with the “Referer” header, either erasing it, or replacing it with the destination domain. Further, pages not served over TLS should not receive the “Referer” header when sent from a page served over TLS. (Forms served over TLS should never post to a page not served over TLS anyways.) Lastly, the HTML5 standard includes the HTML meta header ‘referer’ that can be set to ‘no-referrer’ where the browser is not supposed to send the referrer from the source page.
The last check that should be performed is the geolocation of the source IP address within the context of the form. For example, if the form is for online ordering of a pizzeria in Chicago, an request or submissions from an IP addresses geolocated in Australia has a very low probability of being legitimate.
There is one caveat of filtering based upon IP addresses: VPNs and Proxies. Smart phones in particular should be the biggest concern for most implementations, since the IP address of a phone on the network is often geolocated to the corporate headquarters rather than the location of the user.
Example HTTP Header Validation Function
//! Checks for HTTP Request Headers of a submitted form page
//! @param string $formUrl URL of the form page
//! @return bool Returns TRUE if all the checks passed, else FALSE
function checkRequestHeaders($formUrl = ‘’) {
// make sure $formUrl is a legit url
if(!empty($formUrl) && !filter_var($formUrl, FILTER_VALIDATE_URL, FILTER_FLAG_SCHEME_REQUIRED & FILTER_FLAG_HOST_REQUIRED) {
return FALSE;
}
// verify presence of basic headers
if(empty($_SERVER[‘HTTP_USER_AGENT’])
|| empty($_SERVER[‘REMOTE_ADDR’])
|| empty($_SERVER[‘HTTP_ACCEPT’])
|| empty($_SERVER[‘HTTP_ACCEPT_LANGUAGE’])
) {
return FALSE;
}
return TRUE;
}
A complete list of the HTML5 input fields can be found on Wikipedia:
https://en.wikipedia.org/wiki/List_of_HTTP_header_fields
W3C Referrer Policy Working Draft