Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

PloS

COUNTER

NEEO

AWstats

Description

<ac:structured-macro ac:name="unmigrated-wiki-markup" ac:schema-version="1" ac:macro-id="e39d2ecd3ffdf2c1-e590824d-48c54efb-801faa0f-57bc8c82c5de115e7e84bb88"><ac:plain-text-body><![CDATA[

[^a]fish


 

 

 

]]></ac:plain-text-body></ac:structured-macro>

<ac:structured-macro ac:name="unmigrated-wiki-markup" ac:schema-version="1" ac:macro-id="9bb8189ecc39e637-dafb2e1a-4d234b02-893b8d88-970ec9ebd87e0d34d9471614"><ac:plain-text-body><![CDATA[

[+:,\.\;\/
]]></ac:plain-text-body></ac:structured-macro>
-]bot


 

 

 

acme\.spider


 

 

 

alexa


 

 

 

Alexandria(\s|)prototype(\s|)project

Alexandria prototype project

 

 

 

AllenTrack


 

 

 

almaden


 

 

 

appie


 

 

 

Arachmo

Arachmo

 

 

 

archive\.org_bot


 

 

 

arks


 

 

 

asterias


 

 

 

atomz


 

 

 

autoemailspider


 

 

 

awbot


 

 

 

baiduspider


 

 

 

bbot


 

 

 

biadu


 

 

 

biglotron


 

 

 

bloglines


 

 

 

blogpulse


 

 

 

boitho\.com-dc


 

 

 

bookmark-manager


 

 

 

<ac:structured-macro ac:name="unmigrated-wiki-markup" ac:schema-version="1" ac:macro-id="d1df4e2c38f281cd-8cafcafd-4f4c4e13-8af1be98-04f22cc15ed3027e78c23b77"><ac:plain-text-body><![CDATA[

bot[+:,\.\;\/
]]></ac:plain-text-body></ac:structured-macro>
-]


 

 

 

Brutus\/AET

Brutus/AET

 

 

 

bspider


 

 

 

bwh3_user_agent


 

 

 

cfnetwork| checkbot


 

 

 

China\sLocal\sBrowse\s2\.6


 

 

 


Code Sample Web Client

 

 

 

combine


 

 

 

commons-httpclient


 

 

 

ContentSmartz


 

 

 

core


 

 

 

crawl


 

 

 

cursor


 

 

 

custo


 

 

 

DataCha0s\/2\.0


 

 

 

Demo\sBot


 

 

 

docomo


 

 

 

DSurf


 

 

 

dtSearchSpider

dtSearchSpider

 

 

 

dumbot


 

 

 

easydl


 

 

 

EmailSiphon


 

 

 

EmailWolf


 

 

 

exabot


 

 

 

fast-webcrawler


 

 

 

favorg


 

 

 

FDM(\s|+)1

FDM 1

 

 

 

feedburner


 

 

 

feedfetcher-google


 

 

 

Fetch(\s|)API(\s|)Request

Fetch API Request

 

 

 

findlinks


 

 

 

gaisbot


 

 

 

GetRight

GetRight

 

 

 

geturl


 

 

 

gigabot


 

 

 

girafabot


 

 

 

gnodspider


 

 

 

Goldfire(\s|+)Server

Goldfire Server

 

 

 

Googlebot

Googlebot

 

 

 

grub


 

 

 

heritrix


 

 

 

hl_ftien_spider


 

 

 

holmes


 

 

 

htdig


 

 

 

htmlparser


 

 

 

httpget-5\.2\.2

httpget-5.2.2

 

 

 

httrack


 

 

 

HTTrack

HTTrack

 

 

 

ia_archiver


 

 

 

ichiro


 

 

 

iktomi


 

 

 

ilse


 

 

 

internetseer


 

 

 

iSiloX

iSiloX

 

 

 

java


 

 

 

jeeves


 

 

 

jobo


 

 

 

larbin


 

 

 

libwww-perl

libwww-perl

 

 

 

linkbot


 

 

 

linkchecker


 

 

 

linkscan


 

 

 

linkwalker


 

 

 

livejournal\.com


 

 

 

lmspider


 

 

 

LOCKSS


 

 

 

LWP\:\:Simple

LWP::Simple

 

 

 

lwp-request


 

 

 

lwp-tivial


 

 

 

lwp-trivial

lwp-trivial

 

 

 

lycos


 

 

 

mediapartners-google


 

 

 

megite


 

 

 

Microsoft(\s|)URL(\s|)Control

Microsoft URL Control

 

 

 

milbot

Milbot

 

 

 

mj12bot


 

 

 

mnogosearch


 

 

 

mojeekbot


 

 

 

momspider


 

 

 

motor


 

 

 

msiecrawler


 

 

 

msnbot


 

 

 


MSNBot

 

 

 

MuscatFerre


 

 

 

myweb


 

 

 

NABOT


 

 

 

nagios


 

 

 

NaverBot

NaverBot

 

 

 

netcraft


 

 

 

netluchs


 

 

 

ng\/2\.


 

 

 

no_user_agent


 

 

 

nutch


 

 

 

ocelli


 

 

 

Offline(\s|+)Navigator

Offline Navigator

 

 

 

OurBrowser


 

 

 

perman


 

 

 

pioneer


 

 

 

playmusic\.com


 

 

 


playstarmusic.com

 

 

 

powermarks


 

 

 

psbot


 

 

 

python


 

 

 


Python-urllib

 

 

 

qihoobot


 

 

 

rambler


 

 

 

Readpaper

Readpaper

 

 

 

redalert| robozilla


 

 

 

robot


 

 

 

scan4mail


 

 

 

scooter


 

 

 

seekbot


 

 

 

seznambot


 

 

 

shoutcast


 

 

 

slurp


 

 

 

sogou


 

 

 

speedy


 

 

 

spider


 

 

 

spider


 

 

 

spiderman


 

 

 

spiderview


 

 

 

Strider

Strider

 

 

 

sunrise


 

 

 

superbot


 

 

 

surveybot


 

 

 

T-H-U-N-D-E-R-S-T-O-N-E

T-H-U-N-D-E-R-S-T-O-N-E

 

 

 

tailrank


 

 

 

technoratibot


 

 

 

Teleport(\s|+)Pro

Teleport Pro

 

 

 

Teoma

Teoma

 

 

 

titan


 

 

 

turnitinbot


 

 

 

twiceler


 

 

 

ucsd


 

 

 

ultraseek


 

 

 

urlaliasbuilder


 

 

 

voila


 

 

 

w3c-checklink


 

 

 

Wanadoo


 

 

 

Web(\s|+)Downloader

Web Downloader

 

 

 

WebCloner

WebCloner

 

 

 

webcollage


 

 

 

WebCopier

WebCopier

 

 

 

Webinator


 

 

 

Webmetrics


 

 

 

webmirror


 

 

 

WebReaper

WebReaper

 

 

 

WebStripper

WebStripper

 

 

 

WebZIP

WebZIP

 

 

 

Wget

Wget

 

 

 

wordpress


 

 

 

worm


 

 

 

Xenu(\s|)Link(\s|)Sleuth

Xenu Link Sleuth

 

 

 

y!j


 

 

 

yacy


 

 

 

yahoo-mmcrawler


 

 

 

yahoofeedseeker


 

 

 

yahooseeker


 

 

 

yandex


 

 

 

yodaobot


 

 

 

zealbot


 

 

 

zeus


 

 

 

zyborg


 

 

 

The robotlist.txt might going to look like this

Code Block
xml
xml
titleListing 6robotsexclusionlist.txt
linenumberstrue
collapsetrue
2010-05-06
[^a]fish
[+:,\.\;\/\\-]bot
acme\.spider
alexa
Alexandria(\s|\+)prototype(\s|\+)project
AllenTrack
almaden
appie
Arachmo
archive\.org_bot
arks
asterias
atomz
autoemailspider
awbot
baiduspider
bbot
biadu
biglotron
bloglines
blogpulse
boitho\.com\-dc
bookmark\-manager
bot[+:,\.\;\/\\-]
Brutus\/AET
bspider
bwh3_user_agent
cfnetwork| checkbot
China\sLocal\sBrowse\s2\.6
combine
commons\-httpclient
ContentSmartz
core
crawl
cursor
custo
DataCha0s\/2\.0
Demo\sBot
docomo
DSurf
dtSearchSpider
dumbot
easydl
EmailSiphon
EmailWolf
exabot
fast-webcrawler
favorg
FDM(\s|\+)1
feedburner
feedfetcher\-google
Fetch(\s|\+)API(\s|\+)Request
findlinks
gaisbot
GetRight
geturl
gigabot
girafabot
gnodspider
Goldfire(\s|\+)Server
Googlebot
grub
heritrix
hl_ftien_spider
holmes
htdig
htmlparser
httpget\-5\.2\.2
httrack
HTTrack
ia_archiver
ichiro
iktomi
ilse
internetseer
iSiloX
java
jeeves
jobo
larbin
libwww\-perl
linkbot
linkchecker
linkscan
linkwalker
livejournal\.com
lmspider
LOCKSS
LWP\:\:Simple
lwp\-request
lwp\-tivial
lwp\-trivial
lycos
mediapartners\-google
megite
Microsoft(\s|\+)URL(\s|+)Control
milbot
mj12bot
mnogosearch
mojeekbot
momspider
motor
msiecrawler
msnbot
MuscatFerre
myweb
NABOT
nagios
NaverBot
netcraft
netluchs
ng\/2\.
no_user_agent
nutch
ocelli
Offline(\s|\+)Navigator
OurBrowser
perman
pioneer
playmusic\.com
powermarks
psbot
python
qihoobot
rambler
Readpaper
redalert| robozilla
robot
scan4mail
scooter
seekbot
seznambot
shoutcast
slurp
sogou
speedy
spider
spider
spiderman
spiderview
Strider
sunrise
superbot
surveybot
T\-H\-U\-N\-D\-E\-R\-S\-T\-O\-N\-E
tailrank
technoratibot
Teleport(\s|\+)Pro
Teoma
titan
turnitinbot
twiceler
ucsd
ultraseek
urlaliasbuilder
voila
w3c\-checklink
Wanadoo
Web(\s|\+)Downloader
WebCloner
webcollage
WebCopier
Webinator
Webmetrics
webmirror
WebReaper
WebStripper
WebZIP
Wget
wordpress
worm
Xenu(\s|\+)Link(\s|\+)Sleuth
y!j
yacy
yahoo\-mmcrawler
yahoofeedseeker
yahooseeker
yandex
yodaobot
zealbot
zeus
zyborg

Or it might look like the proposed XML version of the Robot exclusion list

Code Block
xml
xml
titleListing 6robotsexclusionlist.xml
linenumberstrue
collapsetrue
<?xml version="1.0" encoding="UTF-8" ?>
<exclusions version="1.0" datestamp="2010-04-10">
  <robot-list source="COUNTER" version="R3" datestamp="2010-04-01">
    <description>Human-friendly description/notes about the COUNTER exclusion list</description>
    <useragent>String to match for COUNTER</useragent>
    <useragent>Another string to match for COUNTER</useragent>
    <useragent>Etc.</useragent>
  </robot-list>
  <robot-list source="AWStats" version="x" datestamp="2009-10-02">
    <description>Human-friendly description/notes about the AWStats exclusion list</description>
    <useragent>String to match for AWStats</useragent>
    <useragent>Another string to match for AWStats</useragent>
    <useragent>Etc.</useragent>
  </robot-list>
  <robot-list source="PLoS" version="y" datestamp="2010-03-11">
    <description>Human-friendly description/notes about the PLoS exclusion list</description>
    <useragent>String to match for PLoS</useragent>
    <useragent>Another string to match for PLoS</useragent>
    <useragent>Etc.</useragent>
  </robot-list>
</exclusions>