Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

OAI-PMH is a relatively light-weight protocol which does not allow for a bidirectional traffic. If a more reliable error-handling is required, the Standardised Usage Statistics Harvesting Initiative (SUSHI) must be used. SUSHI http://www.niso.org/schemas/sushi/ was developed by NISO (National Information Standards Organization) in cooperation with COUNTER. This document assumes that the communication between the aggregator and the usage data provider takes place as is explained in figure 4.

Figure 4.

The interaction commences when the log aggregator sends a request for a report about the daily usage of a certain repository. Two parameters must be sent as part of this request: (1) the date of the report and (2) the file name of the most recent robot filter. The filename that is mentioned in this request will be compared to the local filename. Four possible responses can be returned by the repository.

...

Info
titleDefinition of a "robot" according to robotstxt.org

A robot is a program that automatically traverses the Web's hypertext structure by retrieving a document, and recursively retrieving all documents that are referenced.
- http://www.robotstxt.org/faq/what.html, also used as definition by (Geens, 2006), (Heinonen, 1996}

6.2.2. Strategy

It is decided to make a distinction between two 'layers' of robot filtering (see also figure 5):

  1. Local repositories should make use of a "core" list of robots. It was agreed that a list can probably be created quite easily by combining entries from the lists that are used by COUNTER, AWStats, Universidade do Minho and PLoS. This basic list will filter about 80% of all automated visits.
  2. Dedicated service providers can carry out some more advanced filtering, on the basis of sophisticated algorithms. The specification of these more advanced heuristics will be a separate research activity. Centralised heuristics should improve confidence in the reliability of the statistics.


Figure 5.

Internet robots will be identified by comparing the value of the User Agent HTTP header to regular expressions which are present in a list of known robots which is managed by a central authority. All entries are expected to conform to the definition of a robot as provided in section 5.2.1. All institutions that send usage data must first check the entry against this list of internet robots. If the robot is in the list, the event should not be sent. It has been decided not to filter robots on their IP-addresses. The reason for this is that IP-addresses change very regularly, and this would make the list very difficult to maintain.

Note

In the study of (Geens, 2006), using the user agent field in the log file resulted in a recall of merely 26.56%, with a precision of 100%.

As an alternative, identifying robots by analyzing the following 4 components resulted in a recall of 73%, with a precision of 100%:

  • identifying who accessed robots.txt, which are usually not normal users but bots
  • identifying with an IP-address list of known bots
  • identifying with the user agent field
  • identifying who accesses pages with a HEAD method, where normal users perform the GET method

More alternatives are given the report, perhaps an interesting read

- Max Kemman

6.2.3. Robot list schema

The robot list must meet the following requirements:

...