Bots are exceedingly common on the web. In fact, as of 2012, bot traffic exceeded human traffic on the web. That’s right; more than 50% of the hits on your website, on average, come from robots rather than humans.
Bots have a wide range of purposes, and not all of them are bad. Some bots, like the bots wielded by Google and Bing, crawl and index your pages. If you were to block the Googlebot, your site will eventually be removed from their index; they can no longer access it, so your content won’t show up.
Other bots have more niche uses. There are bots that exist solely to crawl e-commerce websites, looking for deals. They cross-reference every e-shop they can find with a given product, so the home site can show the prices for the product at a wide range of shops. Some sites will use these to ensure they’re at the top of the list, which is why a lot of Amazon listings will gradually creep down a few cents every day; competing sellers out-listing each other by tweaking prices down a penny or two at a time.
Other bots are less benign. Spam bots will search blogs looking for various comment systems they know how to exploit. Comments without authentication or captchas can be filled out by bots, and spam comments can be left to build link juice to spam sites, capture the clicks of ignorant web users, or even bomb an otherwise benign site with negative SEO.
Hacker bots exist to crawl the web looking at site infrastructure. They test domains for common /admin.htm style URLs, looking for websites that use a default CMS and haven’t changed things like the username or password. They search for vulnerable sites, the low hanging fruit, that they can access and exploit. They might harvest admin or user information, or just report URLs back to the owner of the hacker bot. They might be programmed to simply take down a site and replace it with their own content.
Malicious bots stem from computer viruses. They take over a user’s computer and, either overtly or in the background, use the internet access capability of that computer to do whatever the owner of the virus wants done. Often this is simply used to hammer a given URL in a DDoS attack, aimed at taking down the site, or stressing the server enough for a hacker to get in through a bug in the code. Know more about the best anti virus between AVG vs Avast.
Scraper bots are malicious as well; they act like search engine bots, scraping content. Rather than adding it to a search index, however, they simply copy the content wholesale. Content, scripts, media; it’s all downloaded and placed on the spammer’s server, so they can use it to spin into – or just paste wholesale – content for their spam sites. It’s all disposable to them, just a resource they harvest and drop when it’s no longer useful.
Obviously, there’s a lot wrong with these bots. On top of their purposes, though, they have another side-effect; server strain. Bots might be able to access your site in a stripped down, lightweight manner – the search engine bots often do – but even if they do, they’re still accessing your site. They still download content, make requests from your server, and generally use up resources.
In many cases, this can even bring down a site. I’ve seen reports of sites that have been hammered just by Google alone and been brought down, though Google is often smart enough to avoid doing so. With the sheer press of bot traffic on the web, though, there’s a lot to contend with.
This is all not to mention the issues with data analysis that come along later. It’s a huge issue to eliminate bot traffic from Google Analytics, so the information you can analyze actually reflects human usage, not software usage.
Blocking Bots
There are two ways to block bots trying to access your site. One is through the robots.txt file, and the other is through the .htaccess file.
As you might have guessed from the title of this post, I’m going to be focusing on the second one. First, though, let’s talk about robots.txt. What is a robots.txt file?
A robots.txt file is a text file you put in the root director of your server. Its purpose is to give guidelines to bots that want to access your site. You can use it to block bot access, either to specific bots or to all bots. So why not use it?
The issue with robots.txt is that it’s giving guidance to the bots. If the bots choose not to respect it – by which I mean, if the creator of the bot programs it to ignore robots.txt – you can’t do anything. It’s like having your front gate open, but with a sign posted that says “robbers stay away.” If the robber chooses to ignore the sign, nothing stops them from walking through the gate.
The .htaccess file is a configuration file that is used by the Apache web server software. It’s a lot more like a security guard at the front gate, actively stopping potential robbers. Except in this case, the security guard has the ability to see whether or not the person trying to enter is coming from RobberHome, is wearing a shirt that says “I’m a robber”, or otherwise identifies itself.
What this means is that the .htaccess file can actively block most bots, but not all bots. In particular, the botnet bots – slaved computers from normal users – are generally not blocked by default. This is because those are regular user computers, using regular user software. If you block them, you’re blocking humans. For most other bots, though, the .htaccess file is ideal.
Note that using the .htaccess file can only be done if your web server is running Apache. If you’re using Nginx, Lighttpd, or one of the other niche server architectures, you’ll have to find that software’s way of blocking bots.
Identifying Bots to Block
First of all, a word of warning. Be very careful when you’re blocking bots through the .htaccess file. One typo and you can end up blocking the entire Internet. Obviously you don’t want that.
The first thing you want to do is back up your current .htaccess file. In the case of an error that blocks traffic you don’t want blocked, you can restore the old file to revert the changes until you can figure out what went wrong.
The second thing you want to do is figure out how to find your own access logs. With Apache, you need to use a Linux/Unix command to access the log file. You can read about how to do that in this guide.
Using that guide, you will generate a log file that shows server access with quite a bit of detail. It will show you the IP address used to access the server, the identity of the client machine if available, the user ID of the machine if it used authentication, the time of the request, whether it was accessed by HTTP, the status code the server returned, and the size of the object requested. This will, likely, be a huge file.
The log file will have data on all of your regular users, and all of your bot access. Some bots, like the Google bots, will identify themselves through their user agent information. Bad bots sometimes identify themselves, but often just have certain characteristics that flag them as non-human. They might be using an out of date version of a browser commonly known to be exploited. They might come from known spam addresses or domains.
This guide is pretty good at helping you identify which log entries are bad bots and which are either good bots or good users.
Generally, if a bot is only accessing your site once a month, you don’t necessarily need to worry about it. You can block it if you like, but it isn’t necessarily going to save you might time or effort. Your primary goal should be to block the bots that visit constantly and have a negative impact on the performance of your server.
Be very careful when blocking by IP address or IP range. It’s easy to see a lot of bots coming from something like 168.*.*.*, with a variety of different numbers in the stars, and think “I can just block all of those! Block the entire /8 range!” The problem is, a /8 range in IPv4 is 16,777,216 different IP addresses, many of which may be used by legitimate users. You could block a massive amount of legitimate traffic with one overly-broad rule.
Most entries in a .htaccess file won’t block via IP address, simply because an IP address is too easy to change via proxies. Most will use user agent names, specific recurring IP addresses from bots that don’t care to change, or domains generally used to host spambots or hacker tools.
Using The .htaccess File
There are three ways we’re going to use to block bots through the .htaccess file. The first is the most common, using the user agent of the bot to block it. This is generally reliable, as normal users won’t accidentally have a bot user agent.
In your .htaccess file, you first want a line that says “RewriteEngine on”. This line makes sure that any following rewrite lines will work rather than being parsed as comments.
Next, you can add “RewriteCond %{HTTP_USER_AGENT} \” as its own line. This enables a rewrite condition based on user-agent. You have two options here; you can either add a ton of different user agents after this one line, or you can add one user agent and then add the line again. For example:
RewriteCond %{HTTP_USER_AGENT} \ 12soso|\ 192\.comagent|\ 1noonbot|\ 1on1searchbot|\ 3de\_search2|\ [NC,OR] Or: RewriteCond %{HTTP_USER_AGENT} ^[Ww]eb[Bb]andit [NC,OR] RewriteCond %{HTTP_USER_AGENT} ^$ [NC,OR] RewriteCond %{HTTP_USER_AGENT} ^Acunetix [NC,OR] RewriteCond %{HTTP_USER_AGENT} ^binlar [NC,OR] RewriteCond %{HTTP_USER_AGENT} ^BlackWidow [NC,OR]
Both work fine. With the first example, you will need to add an additional RewriteCond line every ~500 or so entries. This is because the longer the line with one command, the harder it is for Apache to parse. Breaking it up into individual entries makes it more cluttered but perhaps easier to read. Regardless, you can use either method.
The NC and OR bits at the end are rewrite flags. NC means “nocase”, which means the entry is not case-sensitive. It means “12soso” and “12Soso” are treated the same way. OR means “this or that”, as in, the bot will be blocked as long as it matches one or another of the entries on the list, as opposed to “AND”, which would be all of them.
After your list of bots here, you need to specify the rewrite rule. All of this is just the first part of a two-part clause: if the URL matches this, then… The second part is what happens. Add “RewriteRule .* – [F,L]” on its own line.
What this does is redirects any incoming traffic from the bot user agent to a blocked page. Specifically, it sends the 403 Forbidden code. The [F] is Forbidden, and the [L] is a code indicating that the rewrite rule should be applied immediately rather than after the rest of the .htaccess file is parsed.
The other two methods are blocking based on HTTP referrer, and blocking based on IP address.
To block by HTTP referrer, use “RewriteCond %{HTTP_REFERRER}” as the starting line, use the domain of the exploitative referrer like www1.free-social-buttons\.com, and use the [NC,OR] block. Add the same RewriteRule line afterwards. You’ll end up with something like this:
RewriteCond %{HTTP_REFERER} www4.free-social-buttons\.com RewriteRule ^.* - [F,L] Finally, you can simply block based on IP address. If you notice one particular IP address is particularly detrimental, spamming your site a hundred times an hour or whatever, you can block it. Just write “Deny from *.*.*.*”, where the stars are the IP address. It will look like “Deny from 173.192.34.95”, possibly with a /28 or something at the end to block a range.
Shortcuts
If all of this is a little too complicated, you can take a shortcut and use lists other people have put together. I’ve found two to recommend. First is this pastebin entry from HackRepair.com. The second is this list from Tab Studio.
Any time you add blocks with a .htaccess file, make sure to test access to your site using a few different methods first. If you’re blocked in a way you shouldn’t be, something has gone wrong and you need to fix the entry.