Function Description The bot-detect
plugin can be used to identify and block internet crawlers from accessing site resources.
Running Properties Plugin Execution Phase: Authorization Phase
Plugin Execution Priority: 310
Configuration Fields Name Data Type Required Default Value Description allow array of string Optional - Regular expressions to match the User-Agent request header; requests matching will be allowed to access. deny array of string Optional - Regular expressions to match the User-Agent request header; requests matching will be blocked. blocked_code number Optional 403 HTTP status code returned when a request is blocked. blocked_message string Optional - HTTP response body returned when a request is blocked.
The allow
and deny
fields can both be left unconfigured, in which case the default crawler identification logic will be executed. Configuring the allow
field can allow requests that would otherwise hit the default crawler identification logic. Configuring the deny
field can add additional crawler identification logic.
The default crawler identification regular expression set is as follows:
# Bots General matcher ‘name/0.0’
(?: \/ [ A - Za - z0 - 9. ]+|) { 0 , 5 }([ A - Za - z0 - 9 - ![]:]{ 0 , 50 }(?:[ Aa ] rchiver |[ Ii ] ndexer |[ Ss ] craper |[ Bb ] ot |[ Ss ] pider |[ Cc ] rawl [ a - z ]{ 0 , 50 })) / (?:.( \d +)(?:.( \d +)|)|)
# Bots General matcher ‘name 0.0’
(?: \/ [ A - Za - z0 - 9. ]+|) { 0 , 5 }([ A - Za - z0 - 9 - ![]:]{ 0 , 50 }(?:[ Aa ] rchiver |[ Ii ] ndexer |[ Ss ] craper |[ Bb ] ot |[ Ss ] pider |[ Cc ] rawl [ a - z ]{ 0 , 50 })) ( \d +)(?:.( \d +)(?:.( \d +)|)|)
# Bots containing spider|scrape|bot(but not CUBOT)|Crawl
((?:[ A - z0 - 9 ]{ 1 , 50 }|[ A - z -]{ 1 , 50 } ?|)(?: the |)(?:[ Ss ][ Pp ][ Ii ][ Dd ][ Ee ][ Rr ]|[ Ss ] crape |[ Cc ][ Rr ][ Aa ][ Ww ][ Ll ])[ A - z0 - 9 ]{ 0 , 50 })(?:(?:[ /]| v )( \d +)(?:.( \d +)|)(?:.( \d +)|)|)
# Bots Pattern ‘/name-0.0’
/((?: Ant -)? Nutch |[ A - z ]+[ Bb ] ot |[ A - z ]+[ Ss ] pider | Axtaris | fetchurl | Isara | ShopSalad | Tailsweep ) - (?:.( \d +)(?:.( \d +))?)?
# Bots Pattern ‘name/0.0’
\b ( 008 | Altresium | Argus | BaiduMobaider | BoardReader | DNSGroup | DataparkSearch | EDI | Goodzer | Grub | INGRID | Infohelfer | LinkedInBot | LOOQ | Nutch | OgScrper | PathDefender | Peew | PostPost | Steeler | Twitterbot | VSE | WebCrunch | WebZIP | Y ! J - BR [ A - Z ]| YahooSeeker | envolk | sproose | wminer )/( \d +)(?:.( \d +)|)(?:.( \d +)|)
( CSimpleSpider | Cityreview Robot | CrawlDaddy | CrawlFire | Finderbots | Index crawler | Job Roboter | KiwiStatus Spider | Lijit Crawler | QuerySeekerSpider | ScollSpider | Trends Crawler | USyd - NLP - Spider | SiteCat Webbot | BotName \/\$ BotVersion | 123metaspider - Bot | 1470.net crawler | 50.nu | 8bo Crawler Bot | Aboundex | Accoona -[ A - z ]{ 1 , 30 }- Agent | AdsBot - Google (?:-[ a - z ]{ 1 , 30 }|)| altavista | AppEngine - Google | archive .{ 0 , 30 }. org_bot | archiver | Ask Jeeves |[ Bb ] ai [ Dd ] u [ Ss ] pider (?:-[ A - Za - z ]{ 1 , 30 })(?:-[ A - Za - z ]{ 1 , 30 }|)| bingbot | BingPreview | blitzbot | BlogBridge | Bloglovin | BoardReader Blog Indexer | BoardReader Favicon Fetcher | boitho . com - dc | BotSeer | BUbiNG | \b\w { 0 , 30 } favicon\w { 0 , 30 } \b | \b Yeti (?:-[ a - z ]{ 1 , 30 }|)| Catchpoint (?: bot |)|[ Cc ] harlotte | Checklinks | clumboot | Comodo HTTP ( S ) Crawler | Comodo - Webinspector - Crawler | ConveraCrawler | CRAWL - E | CrawlConvera | Daumoa (?:- feedfetcher |)| Feed Seeker Bot | Feedbin | findlinks | Flamingo_SearchEngine | FollowSite Bot | furlbot | Genieo | gigabot | GomezAgent | gonzo1 |(?:[ a - zA - Z ]{ 1 , 30 }-|) Googlebot (?:-[ a - zA - Z ]{ 1 , 30 }|)| Google SketchUp | grub - client | gsa - crawler | heritrix | HiddenMarket | holmes | HooWWWer | htdig | ia_archiver | ICC - Crawler | Icarus6j | ichiro (?:/ mobile |)| IconSurf | IlTrovatore (?:- Setaccio |)| InfuzApp | Innovazion Crawler | InternetArchive | IP2 [ a - z ]{ 1 , 30 } Bot | jbot\b | KaloogaBot | Kraken | Kurzor | larbin | LEIA | LesnikBot | Linguee Bot | LinkAider | LinkedInBot | Lite Bot | Llaut | lycos | Mail . RU_Bot | masscan | masidani_bot | Mediapartners - Google | Microsoft .{ 0 , 30 } Bot | mogimogi | mozDex | MJ12bot | msnbot (?:- media { 0 , 2 }|)| msrbot | Mtps Feed Aggregation System | netresearch | Netvibes | NewsGator [^/]{ 0 , 30 }|^ NING | Nutch [^/]{ 0 , 30 }| Nymesis | ObjectsSearch | OgScrper | Orbiter | OOZBOT | PagePeeker | PagesInventory | PaxleFramework | Peeplo Screenshot Bot | PlantyNet_WebRobot | Pompos | Qwantify | Read % 20Later | Reaper | RedCarpet | Retreiver | Riddler | Rival IQ | scooter | Scrapy | Scrubby | searchsight | seekbot | semanticdiscovery | SemrushBot | Simpy | SimplePie | SEOstats | SimpleRSS | SiteCon | Slackbot - LinkExpanding | Slack - ImgProxy | Slurp | snappy | Speedy Spider | Squrl Java | Stringer | TheUsefulbot | ThumbShotsBot | Thumbshots . ru | Tiny Tiny RSS | Twitterbot | WhatsApp | URL2PNG | Vagabondo | VoilaBot |^ vortex | Votay bot |^ voyager | WASALive . Bot | Web - sniffer | WebThumb | WeSEE :[ A - z ]{ 1 , 30 }| WhatWeb | WIRE | WordPress | Wotbox | www . almaden . ibm . com | Xenu (?:. s |) Link Sleuth | Xerka [ A - z ]{ 1 , 30 } Bot | yacy (?: bot |)| YahooSeeker | Yahoo ! Slurp | Yandex \w { 1 , 30 }| YodaoBot (?:-[ A - z ]{ 1 , 30 }|)| YottaaMonitor | Yowedo |^ Zao |^ Zao - Crawler | ZeBot_www . ze . bz | ZooShot | ZyBorg )(?:[ /] v ?( \d +)(?:.( \d +)(?:.( \d +)|)|)|)
Configuration Example Allowing Requests That Hit the Crawler Rules
If this configuration is not made, requests from the default Golang network library will be treated as crawlers and blocked.
Adding Crawler Identification
With this configuration, the following requests will be blocked: