Introduction
The modern Internet is full of many malicious robots and crawlers such as malware bots, spam email bots or content scrapers that scan your website in clandestine ways. They do this to discover potential website weaknesses, collect email addresses, or just to steal content from your website. Many of these bots can be identified by their signature "user-agent" signature string.
They may not necessarily be bad or spam, but they also may not add value to your business. Imagine if most of your hits are coming through unwelcome user-agent or spam referers and you think your site is getting good traffic, when in fact they are useless.
As a first line of defense, you can try to prevent malicious bots from accessing your website by blacklisting their user-agents in a robots.txt file. Unfortunately, this works only for well-behaving bots that are designed to obey the robots.txt. Many malicious bots can simply ignore the robots.txt and scan your website at will.
The best way to manage this is to stop them at the edge using network devices like a load balancer, firewall, or CDN. But it may not be possible for a personal blogger or small websites to implement this, and you may want to block at a lower level such as the web server, WordPress, etc.
An alternative way to block robots is to configure your web server so that it refuses to serve content for requests with certain user-agent strings.
This post assumes that you already have a NGINX or Apache web server up and running.
If you already have a list of user-agents and referers that you want to block, let us get it started.
NGINX
To configure the user-agent block list, open the nginx configuration file for your website. This file can be found in different places depending on your nginx setup or Linux distribution (e.g.: /etc/nginx/nginx.conf, /etc/nginx/sites-enabled/<your-site>, /usr/local/nginx/conf/nginx.conf, /etc/nginx/conf.d/<your-site>).
server {
listen 80 default_server;
server_name techievor.com;
root /usr/share/nginx/html;
....
}
Let us say you are getting a lot of auto-requests with the following user-agent strings and you decide to block them.
- Virus
- BadSpider
- _
- netcrawl
- npbot
- malicious
Once the configuration file is open, locate the server section and add the following if statement(s) somewhere within the section.
server {
listen 80 default_server;
server_name techievor.com;
root /usr/share/nginx/html;
# case sensitive matching
if ($http_user_agent ~ (Virus|BadSpider)) {
return 403;
}
#direct match
If($http_user_agent = “-”){
Return 403;
}
# case insensitive matching
if ($http_user_agent ~* (netcrawl|npbot|malicious)) {
return 403;
}
....
}
As you can guess, these if statements match any bad user-agent string with regular expressions and return 403 HTTP status code when a match is found.
$http_user_agent is a variable containing the user-agent string for the HTTP request. The ~ operator performs a case-sensitive match with the user-agent string, while the ~* operator performs a case-insensitive match. The | operator is a logical-OR, so you can put many user-agent keywords in the if statements and block them all.
If you prefer to redirect them somewhere, just use a return 301 instead:
# case insensitive matching
if ($http_user_agent ~* (netcrawl|npbot|malicious)) {
return 301 https://yoursite.com;
}
The following example, which should fall under the location block, will allow you to block requests from referers.
if ($http_referer ~ "semalt\.com|badsite\.net|example\.com") {
return 403;
}
After modifying the configuration file, nginx must be reloaded to activate the ban:
sudo service nginx restart
You can test the user-agent blocking by using wget or curl with the --user-agent option.
wget --user-agent "_" http://<nginx-ip-address>
Managing the User-Agent Blacklist
So far, we have seen how to block HTTP requests using few user-agents in nginx. What if you have many different types of crawlers you want to block?
Since the user-agent blacklist can grow quite large, it is not a good idea to put them all within the nginx's server section. Alternatively, you can create a separate file that lists all the blocked user agents. For example, let us create /etc/nginx/useragent.rules, and define a map of all blocked user agents in the following format.
sudo vi /etc/nginx/useragent.rules
map $http_user_agent $badagent {
default 0;
~*malicious 1;
~*backdoor 1;
~*netcrawler 1;
~virus 1;
~BadSpider 1;
~webbandit 1;
}
Like the previous setup, the ~* operator will match the keyword in case-insensitive manner, while the ~ operator will match the keyword using a case-sensitive regular expression. The line that says default 0 means that any other user-agent not listed in the file will be allowed.
Next, open your website’s nginx configuration file, which contains the http section, and add the following line somewhere inside this section.
http {
.....
include /etc/nginx/useragent.rules
}
Note that this include statement must appear before the server section (which is why we add it inside http section).
Now open the nginx configuration where your server section is defined, and add the following if statement:
server {
....
if ($badagent) {
return 403;
}
....
}
Finally, reload nginx.
sudo service nginx restart
Now any user-agent with a keyword listed in /etc/nginx/useragent.rules will be automatically blocked by nginx.
Apache
To block user-agents in Apache, you can use the mod_rewrite module. Make sure the module is enabled:
a2enmod rewrite
Assuming .htaccess is already enabled on your server (it is on most servers running Apache), add the following near the top in either the .htaccess file or the respective .conf file.
If you have multiple sites configured and want to block a specific URL, you may want to put them in the respective VirtualHost section.
The following example will block any requests containing a user-agent string of badcrawler, badbot, or badagent.
<IfModule mod_rewrite.c>
RewriteEngine On
RewriteCond %{HTTP_USER_AGENT} badcrawler [NC,OR]
RewriteCond %{HTTP_USER_AGENT} badbot [NC,OR]
RewriteCond %{HTTP_USER_AGENT} badagent [NC]
RewriteRule . - [R=403,L]
<IfModule mod_rewrite.c>
If you want to block multiple User-Agents in .htaccess, you can combine them in one line like this.
<IfModule mod_rewrite.c>
RewriteEngine On
RewriteCond %{HTTP_USER_AGENT} ^(badcrawler|badbot|badagent) [NC]
RewriteRule .* - [F,L]
</IfModule>
Alternatively, you can use a SetEnvIfNoCase block, which sets an environment variable if the described condition is met. This can be useful if for some reason mod_rewrite is not available.
<IfModule mod_setenvif.c>
SetEnvIfNoCase User-Agent (badcrawler|badbot|badagent) bad_user_agents
Order Allow,Deny
Allow from all
Deny from env=bad_user_agents
</IfModule>
To block specific referers, use use %{HTTP_REFERER} instead of %{HTTP_USER_AGENT}. The below example will block by the referer names foobar, MaliciousBot, and SpiderBot.
<IfModule mod_rewrite.c>
RewriteEngine on
RewriteCond %{HTTP_REFERER} foobar|MaliciousBot|SpiderBot [NC]
RewriteRule . - [R=403,L]
</IfModule>
As usual, restart the Apache server and test the results.
sudo service apache2 restart
Conclusion
I hope the above tips help you to stop the bad requests so that legitimate requests are not affected. This can be very useful for fending off a DDoS attack from pingback or blocking any other unwanted requests.
If you any questions or thoughts on the tutorial, feel free to reach out in the comments below.
Additional Reading