...
PloS | COUNTER | NEEO | AWstats | Description | ||
<ac:structured-macro ac:name="unmigrated-wiki-markup" ac:schema-version="1" ac:macro-id="e39d2ecd3ffdf2c1-e590824d-48c54efb-801faa0f-57bc8c82c5de115e7e84bb88"><ac:plain-text-body><![CDATA[ | [^a]fish |
|
|
| ]]></ac:plain-text-body></ac:structured-macro> | |
<ac:structured-macro ac:name="unmigrated-wiki-markup" ac:schema-version="1" ac:macro-id="9bb8189ecc39e637-dafb2e1a-4d234b02-893b8d88-970ec9ebd87e0d34d9471614"><ac:plain-text-body><![CDATA[ | [+:,\.\;\/ |
|
|
| ||
acme\.spider |
|
|
| |||
alexa |
|
|
| |||
Alexandria(\s|)prototype(\s|)project | Alexandria prototype project |
|
|
| ||
AllenTrack |
|
|
| |||
almaden |
|
|
| |||
appie |
|
|
| |||
Arachmo | Arachmo |
|
|
| ||
archive\.org_bot |
|
|
| |||
arks |
|
|
| |||
asterias |
|
|
| |||
atomz |
|
|
| |||
autoemailspider |
|
|
| |||
awbot |
|
|
| |||
baiduspider |
|
|
| |||
bbot |
|
|
| |||
biadu |
|
|
| |||
biglotron |
|
|
| |||
bloglines |
|
|
| |||
blogpulse |
|
|
| |||
boitho\.com-dc |
|
|
| |||
bookmark-manager |
|
|
| |||
<ac:structured-macro ac:name="unmigrated-wiki-markup" ac:schema-version="1" ac:macro-id="d1df4e2c38f281cd-8cafcafd-4f4c4e13-8af1be98-04f22cc15ed3027e78c23b77"><ac:plain-text-body><![CDATA[ | bot[+:,\.\;\/ |
|
|
| ||
Brutus\/AET | Brutus/AET |
|
|
| ||
bspider |
|
|
| |||
bwh3_user_agent |
|
|
| |||
cfnetwork| checkbot |
|
|
| |||
China\sLocal\sBrowse\s2\.6 |
|
|
| |||
Code Sample Web Client |
|
|
| |||
combine |
|
|
| |||
commons-httpclient |
|
|
| |||
ContentSmartz |
|
|
| |||
core |
|
|
| |||
crawl |
|
|
| |||
cursor |
|
|
| |||
custo |
|
|
| |||
DataCha0s\/2\.0 |
|
|
| |||
Demo\sBot |
|
|
| |||
docomo |
|
|
| |||
DSurf |
|
|
| |||
dtSearchSpider | dtSearchSpider |
|
|
| ||
dumbot |
|
|
| |||
easydl |
|
|
| |||
EmailSiphon |
|
|
| |||
EmailWolf |
|
|
| |||
exabot |
|
|
| |||
fast-webcrawler |
|
|
| |||
favorg |
|
|
| |||
FDM(\s|+)1 | FDM 1 |
|
|
| ||
feedburner |
|
|
| |||
feedfetcher-google |
|
|
| |||
Fetch(\s|)API(\s|)Request | Fetch API Request |
|
|
| ||
findlinks |
|
|
| |||
gaisbot |
|
|
| |||
GetRight | GetRight |
|
|
| ||
geturl |
|
|
| |||
gigabot |
|
|
| |||
girafabot |
|
|
| |||
gnodspider |
|
|
| |||
Goldfire(\s|+)Server | Goldfire Server |
|
|
| ||
Googlebot | Googlebot |
|
|
| ||
grub |
|
|
| |||
heritrix |
|
|
| |||
hl_ftien_spider |
|
|
| |||
holmes |
|
|
| |||
htdig |
|
|
| |||
htmlparser |
|
|
| |||
httpget-5\.2\.2 | httpget-5.2.2 |
|
|
| ||
httrack |
|
|
| |||
HTTrack | HTTrack |
|
|
| ||
ia_archiver |
|
|
| |||
ichiro |
|
|
| |||
iktomi |
|
|
| |||
ilse |
|
|
| |||
internetseer |
|
|
| |||
iSiloX | iSiloX |
|
|
| ||
java |
|
|
| |||
jeeves |
|
|
| |||
jobo |
|
|
| |||
larbin |
|
|
| |||
libwww-perl | libwww-perl |
|
|
| ||
linkbot |
|
|
| |||
linkchecker |
|
|
| |||
linkscan |
|
|
| |||
linkwalker |
|
|
| |||
livejournal\.com |
|
|
| |||
lmspider |
|
|
| |||
LOCKSS |
|
|
| |||
LWP\:\:Simple | LWP::Simple |
|
|
| ||
lwp-request |
|
|
| |||
lwp-tivial |
|
|
| |||
lwp-trivial | lwp-trivial |
|
|
| ||
lycos |
|
|
| |||
mediapartners-google |
|
|
| |||
megite |
|
|
| |||
Microsoft(\s|)URL(\s|)Control | Microsoft URL Control |
|
|
| ||
milbot | Milbot |
|
|
| ||
mj12bot |
|
|
| |||
mnogosearch |
|
|
| |||
mojeekbot |
|
|
| |||
momspider |
|
|
| |||
motor |
|
|
| |||
msiecrawler |
|
|
| |||
msnbot |
|
|
| |||
MSNBot |
|
|
| |||
MuscatFerre |
|
|
| |||
myweb |
|
|
| |||
NABOT |
|
|
| |||
nagios |
|
|
| |||
NaverBot | NaverBot |
|
|
| ||
netcraft |
|
|
| |||
netluchs |
|
|
| |||
ng\/2\. |
|
|
| |||
no_user_agent |
|
|
| |||
nutch |
|
|
| |||
ocelli |
|
|
| |||
Offline(\s|+)Navigator | Offline Navigator |
|
|
| ||
OurBrowser |
|
|
| |||
perman |
|
|
| |||
pioneer |
|
|
| |||
playmusic\.com |
|
|
| |||
playstarmusic.com |
|
|
| |||
powermarks |
|
|
| |||
psbot |
|
|
| |||
python |
|
|
| |||
Python-urllib |
|
|
| |||
qihoobot |
|
|
| |||
rambler |
|
|
| |||
Readpaper | Readpaper |
|
|
| ||
redalert| robozilla |
|
|
| |||
robot |
|
|
| |||
scan4mail |
|
|
| |||
scooter |
|
|
| |||
seekbot |
|
|
| |||
seznambot |
|
|
| |||
shoutcast |
|
|
| |||
slurp |
|
|
| |||
sogou |
|
|
| |||
speedy |
|
|
| |||
spider |
|
|
| |||
spider |
|
|
| |||
spiderman |
|
|
| |||
spiderview |
|
|
| |||
Strider | Strider |
|
|
| ||
sunrise |
|
|
| |||
superbot |
|
|
| |||
surveybot |
|
|
| |||
T-H-U-N-D-E-R-S-T-O-N-E | T-H-U-N-D-E-R-S-T-O-N-E |
|
|
| ||
tailrank |
|
|
| |||
technoratibot |
|
|
| |||
Teleport(\s|+)Pro | Teleport Pro |
|
|
| ||
Teoma | Teoma |
|
|
| ||
titan |
|
|
| |||
turnitinbot |
|
|
| |||
twiceler |
|
|
| |||
ucsd |
|
|
| |||
ultraseek |
|
|
| |||
urlaliasbuilder |
|
|
| |||
voila |
|
|
| |||
w3c-checklink |
|
|
| |||
Wanadoo |
|
|
| |||
Web(\s|+)Downloader | Web Downloader |
|
|
| ||
WebCloner | WebCloner |
|
|
| ||
webcollage |
|
|
| |||
WebCopier | WebCopier |
|
|
| ||
Webinator |
|
|
| |||
Webmetrics |
|
|
| |||
webmirror |
|
|
| |||
WebReaper | WebReaper |
|
|
| ||
WebStripper | WebStripper |
|
|
| ||
WebZIP | WebZIP |
|
|
| ||
Wget | Wget |
|
|
| ||
wordpress |
|
|
| |||
worm |
|
|
| |||
Xenu(\s|)Link(\s|)Sleuth | Xenu Link Sleuth |
|
|
| ||
y!j |
|
|
| |||
yacy |
|
|
| |||
yahoo-mmcrawler |
|
|
| |||
yahoofeedseeker |
|
|
| |||
yahooseeker |
|
|
| |||
yandex |
|
|
| |||
yodaobot |
|
|
| |||
zealbot |
|
|
| |||
zeus |
|
|
| |||
zyborg |
|
|
|
The robotlist.txt might going to look like this
Code Block | ||||||||||
---|---|---|---|---|---|---|---|---|---|---|
| ||||||||||
2010-05-06 [^a]fish [+:,\.\;\/\\-]bot acme\.spider alexa Alexandria(\s|\+)prototype(\s|\+)project AllenTrack almaden appie Arachmo archive\.org_bot arks asterias atomz autoemailspider awbot baiduspider bbot biadu biglotron bloglines blogpulse boitho\.com\-dc bookmark\-manager bot[+:,\.\;\/\\-] Brutus\/AET bspider bwh3_user_agent cfnetwork| checkbot China\sLocal\sBrowse\s2\.6 combine commons\-httpclient ContentSmartz core crawl cursor custo DataCha0s\/2\.0 Demo\sBot docomo DSurf dtSearchSpider dumbot easydl EmailSiphon EmailWolf exabot fast-webcrawler favorg FDM(\s|\+)1 feedburner feedfetcher\-google Fetch(\s|\+)API(\s|\+)Request findlinks gaisbot GetRight geturl gigabot girafabot gnodspider Goldfire(\s|\+)Server Googlebot grub heritrix hl_ftien_spider holmes htdig htmlparser httpget\-5\.2\.2 httrack HTTrack ia_archiver ichiro iktomi ilse internetseer iSiloX java jeeves jobo larbin libwww\-perl linkbot linkchecker linkscan linkwalker livejournal\.com lmspider LOCKSS LWP\:\:Simple lwp\-request lwp\-tivial lwp\-trivial lycos mediapartners\-google megite Microsoft(\s|\+)URL(\s|+)Control milbot mj12bot mnogosearch mojeekbot momspider motor msiecrawler msnbot MuscatFerre myweb NABOT nagios NaverBot netcraft netluchs ng\/2\. no_user_agent nutch ocelli Offline(\s|\+)Navigator OurBrowser perman pioneer playmusic\.com powermarks psbot python qihoobot rambler Readpaper redalert| robozilla robot scan4mail scooter seekbot seznambot shoutcast slurp sogou speedy spider spider spiderman spiderview Strider sunrise superbot surveybot T\-H\-U\-N\-D\-E\-R\-S\-T\-O\-N\-E tailrank technoratibot Teleport(\s|\+)Pro Teoma titan turnitinbot twiceler ucsd ultraseek urlaliasbuilder voila w3c\-checklink Wanadoo Web(\s|\+)Downloader WebCloner webcollage WebCopier Webinator Webmetrics webmirror WebReaper WebStripper WebZIP Wget wordpress worm Xenu(\s|\+)Link(\s|\+)Sleuth y!j yacy yahoo\-mmcrawler yahoofeedseeker yahooseeker yandex yodaobot zealbot zeus zyborg |
Or it might look like the proposed XML version of the Robot exclusion list
Code Block | ||||||||||
---|---|---|---|---|---|---|---|---|---|---|
| ||||||||||
<?xml version="1.0" encoding="UTF-8" ?> <exclusions version="1.0" datestamp="2010-04-10"> <robot-list source="COUNTER" version="R3" datestamp="2010-04-01"> <description>Human-friendly description/notes about the COUNTER exclusion list</description> <useragent>String to match for COUNTER</useragent> <useragent>Another string to match for COUNTER</useragent> <useragent>Etc.</useragent> </robot-list> <robot-list source="AWStats" version="x" datestamp="2009-10-02"> <description>Human-friendly description/notes about the AWStats exclusion list</description> <useragent>String to match for AWStats</useragent> <useragent>Another string to match for AWStats</useragent> <useragent>Etc.</useragent> </robot-list> <robot-list source="PLoS" version="y" datestamp="2010-03-11"> <description>Human-friendly description/notes about the PLoS exclusion list</description> <useragent>String to match for PLoS</useragent> <useragent>Another string to match for PLoS</useragent> <useragent>Etc.</useragent> </robot-list> </exclusions> |