Block Bots With Cloudflare

Bad Robot! You can block bots with Cloudflare
Image by ergoneon from Pixabay

Like all websites, mine attract the attention of Internet miscreants poking around for vulnerabilities. Thanks to Wordfence and my 7G Firewall, I can block this traffic at my server. But I don’t even want it to get that far. Not only does Cloudflare help with performance, but there are ways to block bots with Cloudflare as well, covered by this post a while ago. But I’ve made some advances since then.

What To Look For

When looking through server logs and the data from Wordfence (Live Traffic) and 7G (gfw_log), the entries will show some, or all, of the following:

Once I get the IP address of the requestor, I look it up at Ultratools to see where it’s coming from. This would give me the ASN (Autonomous System Number) of the network owner.

Clear A Path For The Good Bots

Because Cloudflare prioritizes “Allow” rules, I want to make sure good bots are let through. Cloudflare has a list of verified bots, and it’s pretty thorough. For example, if you want Googlebot to crawl your site, but don’t want fake Googlebots getting through, their cf.client.bot rule knows the good bots from the bad and fake ones. So my first rule is GoodBots. It checks to see if it’s a valid bot, AND that it’s a verified bot I would like to crawl my site.

Rule #1

(cf.client.bot and (http.user_agent contains "UptimeRobot" or http.user_agent contains "DuckDuckBot" http.user_agent contains "Googlebot" or http.user_agent contains "bingbot"))

Block The Bad Bots

This is where I begin the process to block bots with Cloudflare. As most of my traffic comes from US visitors, I use the JS-Challenge feature just in case it’s a legitimate visitor from outside the US. And if they’re coming from an unwanted network, I also Challenge that traffic. I use an ASN list I built using the Ultratool lookup from earlier. Most of that list contains ASNs of hosting services, such as Google User Content, Microsoft Azure, Amazon Web Services and other VPS (Virtual Private Server) hosts. Those are pretty safe to Challenge because those aren’t human browsers. I also challenge verified bots I don’t specifically Allow from my first rule (above). There are some country-specific search engines I don’t need crawling my site, and other bots I just don’t like.

Rule #2

(ip.geoip.country ne "US") or (ip.geoip.asnum in {11111 22222 33333}) or (cf.client.bot)

Edge Cases

Since I haven’t tracked down all the bad bots and ASNs, there are some common requests I don’t want to allow. They’re usually executed by bots that I eventually track down and add their ASNs to the list:

Rule #3

(http.request.uri contains "xmlrpc.php") or (http.request.uri contains "SOME_OTHER_URL") or (http.user_agent contains "SCRAPER_BOT")

A Few Loose Ends

If your quest to block bad bots with Cloudflare continues, there are a couple of last tips:

  1. I sometimes add one extra rule: Threat Score – (cf.threat_score ge 1). It rarely catches anything at this point, but if any suspicious traffic gets through the previous rules, it may JS-Challenge it.
  2. If you have a specific external connection that is blocked by this set of rules, you can add that exception to the first rule so it is Allowed. My example adds an extra User Agent String check for GOODBOT in Rule #1:
(cf.client.bot and (http.user_agent contains "UptimeRobot" or http.user_agent contains "DuckDuckBot" http.user_agent contains "Googlebot" or http.user_agent contains "bingbot") or http.user_agent contains "GOODBOT")

Final Testing at Cloudflare

After adding these rules, I spend a day or two looking at the Firewall Event Log for false positives and false negatives. In my case, if I see a United States request that was Challenged, I look to see if if was unfairly challenged as a bot I want to allow. If I see a Non-US request that was Allowed, I double check to make sure it’s traffic I don’t mind. It’s usually a bot I like, but from one of their non-US servers.