Robot Detect

Description

bot-detect plugin can be used to identify and prevent web crawlers from crawling websites.

Configuration Fields

NameTypeRequirementDefault ValueDescription
allowarray of stringOptional-A regular expression to match the User-Agent request header and will allow access if the match hits
denyarray of stringOptional-A regular expression to match the User-Agent request header and will block the request if the match hits
blocked_codenumberOptional403The HTTP status code returned when a request is blocked
blocked_messagestringOptional-The HTTP response Body returned when a request is blocked

If field allow and field deny are not configured at the same time, the default logic to identify crawlers will be executed. By configuring the allow field, requests that would otherwise hit the default logic can be allowed. The judgement can be extended by configuring the deny field

The default set of crawler judgment regular expressions is as follows:

  1. # Bots General matcher ‘name/0.0’
    (?:\/[A-Za-z0-9.]+|) {0,5}([A-Za-z0-9 -![]:]{0,50}(?:[Aa]rchiver|[Ii]ndexer|[Ss]craper|[Bb]ot|[Ss]pider|[Cc]rawl[a-z]{0,50}))/ (?:.(\d+)(?:.(\d+)|)|)
    # Bots General matcher ‘name 0.0’
    (?:\/[A-Za-z0-9.]+|) {0,5}([A-Za-z0-9 -![]:]{0,50}(?:[Aa]rchiver|[Ii]ndexer|[Ss]craper|[Bb]ot|[Ss]pider|[Cc]rawl[a-z]{0,50})) (\d+)(?:.(\d+)(?:.(\d+)|)|)
    # Bots containing spider|scrape|bot(but not CUBOT)|Crawl
    ((?:[A-z0-9]{1,50}|[A-z-]{1,50} ?|)(?: the |)(?:[Ss][Pp][Ii][Dd][Ee][Rr]|[Ss]crape|[Cc][Rr][Aa][Ww][Ll])[A-z0-9]{0,50})(?:(?:[ /]| v)(\d+)(?:.(\d+)|)(?:.(\d+)|)|)
    # Bots Pattern ‘/name-0.0’
    /((?:Ant-)?Nutch|[A-z]+[Bb]ot|[A-z]+[Ss]pider|Axtaris|fetchurl|Isara|ShopSalad|Tailsweep) -(?:.(\d+)(?:.(\d+))?)?
    # Bots Pattern ‘name/0.0’
    \b(008|Altresium|Argus|BaiduMobaider|BoardReader|DNSGroup|DataparkSearch|EDI|Goodzer|Grub|INGRID|Infohelfer|LinkedInBot|LOOQ|Nutch|OgScrper|PathDefender|Peew|PostPost|Steeler|Twitterbot|VSE|WebCrunch|WebZIP|Y!J-BR[A-Z]|YahooSeeker|envolk|sproose|wminer)/(\d+)(?:.(\d+)|)(?:.(\d+)|)
    # More bots
    (CSimpleSpider|Cityreview Robot|CrawlDaddy|CrawlFire|Finderbots|Index crawler|Job Roboter|KiwiStatus Spider|Lijit Crawler|QuerySeekerSpider|ScollSpider|Trends Crawler|USyd-NLP-Spider|SiteCat Webbot|BotName\/\$BotVersion|123metaspider-Bot|1470.net crawler|50.nu|8bo Crawler Bot|Aboundex|Accoona-[A-z]{1,30}-Agent|AdsBot-Google(?:-[a-z]{1,30}|)|altavista|AppEngine-Google|archive.{0,30}.org_bot|archiver|Ask Jeeves|[Bb]ai[Dd]u[Ss]pider(?:-[A-Za-z]{1,30})(?:-[A-Za-z]{1,30}|)|bingbot|BingPreview|blitzbot|BlogBridge|Bloglovin|BoardReader Blog Indexer|BoardReader Favicon Fetcher|boitho.com-dc|BotSeer|BUbiNG|\b\w{0,30}favicon\w{0,30}\b|\bYeti(?:-[a-z]{1,30}|)|Catchpoint(?: bot|)|[Cc]harlotte|Checklinks|clumboot|Comodo HTTP(S) Crawler|Comodo-Webinspector-Crawler|ConveraCrawler|CRAWL-E|CrawlConvera|Daumoa(?:-feedfetcher|)|Feed Seeker Bot|Feedbin|findlinks|Flamingo_SearchEngine|FollowSite Bot|furlbot|Genieo|gigabot|GomezAgent|gonzo1|(?:[a-zA-Z]{1,30}-|)Googlebot(?:-[a-zA-Z]{1,30}|)|Google SketchUp|grub-client|gsa-crawler|heritrix|HiddenMarket|holmes|HooWWWer|htdig|ia_archiver|ICC-Crawler|Icarus6j|ichiro(?:/mobile|)|IconSurf|IlTrovatore(?:-Setaccio|)|InfuzApp|Innovazion Crawler|InternetArchive|IP2[a-z]{1,30}Bot|jbot\b|KaloogaBot|Kraken|Kurzor|larbin|LEIA|LesnikBot|Linguee Bot|LinkAider|LinkedInBot|Lite Bot|Llaut|lycos|Mail.RU_Bot|masscan|masidani_bot|Mediapartners-Google|Microsoft .{0,30} Bot|mogimogi|mozDex|MJ12bot|msnbot(?:-media {0,2}|)|msrbot|Mtps Feed Aggregation System|netresearch|Netvibes|NewsGator[^/]{0,30}|^NING|Nutch[^/]{0,30}|Nymesis|ObjectsSearch|OgScrper|Orbiter|OOZBOT|PagePeeker|PagesInventory|PaxleFramework|Peeplo Screenshot Bot|PlantyNet_WebRobot|Pompos|Qwantify|Read%20Later|Reaper|RedCarpet|Retreiver|Riddler|Rival IQ|scooter|Scrapy|Scrubby|searchsight|seekbot|semanticdiscovery|SemrushBot|Simpy|SimplePie|SEOstats|SimpleRSS|SiteCon|Slackbot-LinkExpanding|Slack-ImgProxy|Slurp|snappy|Speedy Spider|Squrl Java|Stringer|TheUsefulbot|ThumbShotsBot|Thumbshots.ru|Tiny Tiny RSS|Twitterbot|WhatsApp|URL2PNG|Vagabondo|VoilaBot|^vortex|Votay bot|^voyager|WASALive.Bot|Web-sniffer|WebThumb|WeSEE:[A-z]{1,30}|WhatWeb|WIRE|WordPress|Wotbox|www.almaden.ibm.com|Xenu(?:.s|) Link Sleuth|Xerka [A-z]{1,30}Bot|yacy(?:bot|)|YahooSeeker|Yahoo! Slurp|Yandex\w{1,30}|YodaoBot(?:-[A-z]{1,30}|)|YottaaMonitor|Yowedo|^Zao|^Zao-Crawler|ZeBot_www.ze.bz|ZooShot|ZyBorg)(?:[ /]v?(\d+)(?:.(\d+)(?:.(\d+)|)|)|)

Configuration Samples

Release Requests that would otherwise Hit the Crawler Rules

  1. allow:
    - “.Go-http-client.

Without this configuration, the default Golang web library request will be treated as a crawler and access will be denied.

Add Crawler Judgement

  1. deny:
    - spd-tools.*”

According to this configuration, the following requests will be denied:

  1. curl http://example.com -H ‘User-Agent: spd-tools/1.1’
    curl http://exmaple.com -H ‘User-Agent: spd-tools’

Only Enabled for Specific Routes or Domains

  1. # Use rules field for fine-grained rule configurations
    rules:
    # Rule 1: Match by route name
    - match_route:
    - route-a
    - route-b
    # Rule 2: Match by domain
    - match_domain:
    - “.example.com”
    - test.com
    allow:
    - “.Go-http-client.*”

In the rule sample of _match_route_, route-a and route-b are the route names provided when creating a new gateway route. When the current route names matches the configuration, the rule following shall be applied. In the rule sample of _match_domain_, *.example.com and test.com are the domain names used for request matching. When the current domain name matches the configuration, the rule following shall be applied. All rules shall be checked following the order of items in the _rules_ field, The first matched rule will be applied. All remained will be ignored.