功能说明

bot-detect插件可以用于识别并阻止互联网爬虫对站点资源的爬取

运行属性

插件执行阶段:授权阶段 插件执行优先级:310

配置字段

名称数据类型填写要求默认值描述
allowarray of string选填-配置匹配 User-Agent 请求头的正则表达式,匹配命中时将允许其访问
denyarray of string选填-配置匹配 User-Agent 请求头的正则表达式,匹配命中时将屏蔽请求
blocked_codenumber选填403配置请求被屏蔽时返回的 HTTP 状态码
blocked_messagestring选填-配置请求被屏蔽时返回的 HTTP 应答 Body

allowdeny 字段可以均不配置,则执行默认的爬虫判断逻辑,通过配置 allow 字段可以将原本命中默认爬虫判断逻辑的请求放行,通过配置 deny 字段可以增加额外的爬虫判断逻辑。

默认的爬虫判断正则表达式集合如下:

  1. # Bots General matcher ‘name/0.0’
  2. (?:\/[A-Za-z0-9.]+|) {0,5}([A-Za-z0-9 -![]:]{0,50}(?:[Aa]rchiver|[Ii]ndexer|[Ss]craper|[Bb]ot|[Ss]pider|[Cc]rawl[a-z]{0,50}))/ (?:.(\d+)(?:.(\d+)|)|)
  3. # Bots General matcher ‘name 0.0’
  4. (?:\/[A-Za-z0-9.]+|) {0,5}([A-Za-z0-9 -![]:]{0,50}(?:[Aa]rchiver|[Ii]ndexer|[Ss]craper|[Bb]ot|[Ss]pider|[Cc]rawl[a-z]{0,50})) (\d+)(?:.(\d+)(?:.(\d+)|)|)
  5. # Bots containing spider|scrape|bot(but not CUBOT)|Crawl
  6. ((?:[A-z0-9]{1,50}|[A-z-]{1,50} ?|)(?: the |)(?:[Ss][Pp][Ii][Dd][Ee][Rr]|[Ss]crape|[Cc][Rr][Aa][Ww][Ll])[A-z0-9]{0,50})(?:(?:[ /]| v)(\d+)(?:.(\d+)|)(?:.(\d+)|)|)
  7. # Bots Pattern ‘/name-0.0’
  8. /((?:Ant-)?Nutch|[A-z]+[Bb]ot|[A-z]+[Ss]pider|Axtaris|fetchurl|Isara|ShopSalad|Tailsweep) -(?:.(\d+)(?:.(\d+))?)?
  9. # Bots Pattern ‘name/0.0’
  10. \b(008|Altresium|Argus|BaiduMobaider|BoardReader|DNSGroup|DataparkSearch|EDI|Goodzer|Grub|INGRID|Infohelfer|LinkedInBot|LOOQ|Nutch|OgScrper|PathDefender|Peew|PostPost|Steeler|Twitterbot|VSE|WebCrunch|WebZIP|Y!J-BR[A-Z]|YahooSeeker|envolk|sproose|wminer)/(\d+)(?:.(\d+)|)(?:.(\d+)|)
  11. # More bots
  12. (CSimpleSpider|Cityreview Robot|CrawlDaddy|CrawlFire|Finderbots|Index crawler|Job Roboter|KiwiStatus Spider|Lijit Crawler|QuerySeekerSpider|ScollSpider|Trends Crawler|USyd-NLP-Spider|SiteCat Webbot|BotName\/\$BotVersion|123metaspider-Bot|1470.net crawler|50.nu|8bo Crawler Bot|Aboundex|Accoona-[A-z]{1,30}-Agent|AdsBot-Google(?:-[a-z]{1,30}|)|altavista|AppEngine-Google|archive.{0,30}.org_bot|archiver|Ask Jeeves|[Bb]ai[Dd]u[Ss]pider(?:-[A-Za-z]{1,30})(?:-[A-Za-z]{1,30}|)|bingbot|BingPreview|blitzbot|BlogBridge|Bloglovin|BoardReader Blog Indexer|BoardReader Favicon Fetcher|boitho.com-dc|BotSeer|BUbiNG|\b\w{0,30}favicon\w{0,30}\b|\bYeti(?:-[a-z]{1,30}|)|Catchpoint(?: bot|)|[Cc]harlotte|Checklinks|clumboot|Comodo HTTP(S) Crawler|Comodo-Webinspector-Crawler|ConveraCrawler|CRAWL-E|CrawlConvera|Daumoa(?:-feedfetcher|)|Feed Seeker Bot|Feedbin|findlinks|Flamingo_SearchEngine|FollowSite Bot|furlbot|Genieo|gigabot|GomezAgent|gonzo1|(?:[a-zA-Z]{1,30}-|)Googlebot(?:-[a-zA-Z]{1,30}|)|Google SketchUp|grub-client|gsa-crawler|heritrix|HiddenMarket|holmes|HooWWWer|htdig|ia_archiver|ICC-Crawler|Icarus6j|ichiro(?:/mobile|)|IconSurf|IlTrovatore(?:-Setaccio|)|InfuzApp|Innovazion Crawler|InternetArchive|IP2[a-z]{1,30}Bot|jbot\b|KaloogaBot|Kraken|Kurzor|larbin|LEIA|LesnikBot|Linguee Bot|LinkAider|LinkedInBot|Lite Bot|Llaut|lycos|Mail.RU_Bot|masscan|masidani_bot|Mediapartners-Google|Microsoft .{0,30} Bot|mogimogi|mozDex|MJ12bot|msnbot(?:-media {0,2}|)|msrbot|Mtps Feed Aggregation System|netresearch|Netvibes|NewsGator[^/]{0,30}|^NING|Nutch[^/]{0,30}|Nymesis|ObjectsSearch|OgScrper|Orbiter|OOZBOT|PagePeeker|PagesInventory|PaxleFramework|Peeplo Screenshot Bot|PlantyNet_WebRobot|Pompos|Qwantify|Read%20Later|Reaper|RedCarpet|Retreiver|Riddler|Rival IQ|scooter|Scrapy|Scrubby|searchsight|seekbot|semanticdiscovery|SemrushBot|Simpy|SimplePie|SEOstats|SimpleRSS|SiteCon|Slackbot-LinkExpanding|Slack-ImgProxy|Slurp|snappy|Speedy Spider|Squrl Java|Stringer|TheUsefulbot|ThumbShotsBot|Thumbshots.ru|Tiny Tiny RSS|Twitterbot|WhatsApp|URL2PNG|Vagabondo|VoilaBot|^vortex|Votay bot|^voyager|WASALive.Bot|Web-sniffer|WebThumb|WeSEE:[A-z]{1,30}|WhatWeb|WIRE|WordPress|Wotbox|www.almaden.ibm.com|Xenu(?:.s|) Link Sleuth|Xerka [A-z]{1,30}Bot|yacy(?:bot|)|YahooSeeker|Yahoo! Slurp|Yandex\w{1,30}|YodaoBot(?:-[A-z]{1,30}|)|YottaaMonitor|Yowedo|^Zao|^Zao-Crawler|ZeBot_www.ze.bz|ZooShot|ZyBorg)(?:[ /]v?(\d+)(?:.(\d+)(?:.(\d+)|)|)|)

配置示例

放行原本命中爬虫规则的请求

  1. allow:
  2. - “.Go-http-client.

若不作该配置,默认的 Golang 网络库请求会被视做爬虫,被禁止访问

增加爬虫判断

  1. deny:
  2. - spd-tools.*”

根据该配置,下列请求将被禁止访问:

  1. curl http://example.com -H ‘User-Agent: spd-tools/1.1’
  2. curl http://exmaple.com -H ‘User-Agent: spd-tools’