Child pages
  • KE Usage Statistics Guidelines
Skip to end of metadata
Go to start of metadata

You are viewing an old version of this page. View the current version.

Compare with Current View Page History

« Previous Version 4 Next »

Documentinformation"> Document information

Date published:
Excerpt: Write an excerpt here

(Optional information)

Unknown macro: {page-info}

Macro om labels te geven niet mogelijk in huidige installatie, wordt nog geupdate

This page is maintained by: KE Usage Statistics Work Group

Document History


Version history



First draft, based on technical specifications from the OA-Statistics project (written by Daniel Metje and Hans-Werner Hilse), the NEEO project (witten by Benoit Pauwels) and the SURE project (written by Peter Verhaar and Lucas van Schaik)

Peter Verhaar


The abstract describes what the application profile is about. It should contain a problem definition, the standards described by the application profile and the goal of the application profile.

From here, content can be added. Remember to start chapters with {anchor:Chaptername} and include [#Chaptername] in the Table of contents.
XML may be added between {code:xml|collapse=true|linenumbers=true} {code} (remove the \ to use tags)


The impact or the quality of academic publications is traditionally measured by considering the number of times the text is cited. Nevertheless, the existing system for citation-based metrics has frequently been the target of serious criticism. Citation data provided by ISI focus on published journal articles only, and other forms of academic output, such as dissertations or monographs are mostly neglected. In addition, it normally takes a long time before citation data can become available, because of publication lags. As a result of this growing dissatisfaction with citation-based metrics, a number of research projects have begun to explore alternative methods for the measurement of academic impact. Many of these initiatives have based their findings on usage data. An important advantage of download statistics is that they can readily be applied to all electronic resources, regardless of their contents. Whereas citation analyses only reveal usage by authors of journal articles, usage data can in theory be produced by any user. An additional benefit of measuring impact via the number of downloads is the fact that usage data can become available directly after the document has been placed on-line.

Virtually all web servers that provide access to electronic resources record usage events as part of their log files. Such files usually provide detailed information on the documents that have been requested, on the users that have initiated these requests, and on the moments at which these requests took place. One important difficulty is the fact that these logs are usually structured according to a proprietary format. Before usage data from different institutions can be compared in a meaningful and consistent way, the log entries need to be standardised and normalised. Various projects have investigated how such data harmonisation can take place. In the MESUR project, usage data have been standardised by serialising the information from log files as XML files structured according to the OpenURL Context Objects schema (Bollen and Van de Sompel, 2006). This same standard is recommended in the JISC Usage Statistics Final Report. Using this metadata standard, it becomes possible to set up an infrastructure in which usage data are aggregated within a network of distributed repositories. The PIRUS-I project (Publishers and Institutional Repository Usage Statistics), which was funded by JISC, has investigated how such exchange of usage data can take place. An important outcome of this project was a range of scenarios for the "creation, recording and consolidation of individual article usage statistics that will cover the majority of current repository installations" .

"Developing a global standard to enable the recording, reporting and consolidation of online usage statistics for individual journal articles hosted by institutional repositories, publishers and other entities (Final Report)",  p.3. < >

In Europe, at least three projects have experimented with these recommendations and have actually implemented infrastructures for the central accumulation of usage data. Firstly, the German OA-Statistics project, which is funded DINI (Deutsche Initiative fur Netzwerk Information), has set up an infrastructure in which various certified repositories across Germany can exchange their usage data. In the Netherlands, the project Statistics on the Usage of Repositories (SURE) has a very similar objective. The project, which is funded by SURFfoundation, aimed to find a method for the creation of reliable and mutually comparable usage statistics and has implemented a national infrastructure for the accumulation of usage data. Thirdly, the Network of European Economists Online (NEEO) is an international consortium of 18 universities which maintains a subject repository that provides access to the results of economic research. As part of this project, extensive guidelines have been developed for the creation of usage statistics.

Whereas these three projects all make use of the OpenURL Context Object standard, some subtle differences have emerged in the way in which this standard is actually used. Nevertheless, it is important to ensure that statistics are produced in exactly the same manner, since, otherwise, it would be impossible to compare metrics produced by different projects. With the support of Knowledge Exchange, a collaborative initiative for leading national organisations in Europe, an initiative was begun to align the technical specifications of these various projects. This document is a first proposal for international guidelines for the accumulation and the exchange of usage data. The proposal is based on a careful comparison of the technical specifications that have been developed by these three projects.

Terminology and definitions

A usage event takes place when a user downloads a document which is managed in a repository, or when a user views the metadata that is associated with this document. The user may have arrived at this document through the mediation of a referrer. This is typically a search engine. Alternatively, the request may have been mediated by a link resolver. The usage event in turn generates usage data.

The institution that is responsible for the repository that contains the requested document is referred to as a usage data provider. Data can be stored locally in a variety of formats. Nevertheless, to allow for a meaningful central collection of data, usage data providers must be able to expose the data in a standardised data format, so that they can be harvested and transferred to a central database. The institution that manages the central database is referred to as the usage data aggregator. The data must be transferred using a well-defined transfer protocol. Ultimately, certain services can be built on the basis of the data that have been accumulated.

Strategy for Usage Statistics Exchange

In the PIRUS project there were three strategies identified in order to exchange usage statistics.

The Knowledge Exchange Work Group has agreed the following

Usage Statistics Exchange Strategy B will be the used.

In this strategy usage events will be stored on the local repository, and will be harvested on request by a central server.

In particular this means that normalisation does not have to be done by repositories, which will make the implementation and acceptation easier.

Strategy for Normalisation on Usage Statistics

The normalisation will be done in two locations. One at the repository site, filtering usage events for robot activity to reduce the data traffic.


Robot Filtering

The filter is implemented at the repository and is meant for reducing traffic by 80%. This list is basic and simple, and is not meant to filter out false-positives. Heuristics analysis according to a transparent algorithm is NOT done by the repositories, but done at the central server in a later phase. Centralised heuristics will reduce the the debate on reliability.

Definition of  a "robot"

This section tells us: "when do we classify a webagent as a robot?"

Robot filter list expression agreements

  • The robots are NOT expressed by IP-address, but by name.
    • The reason is that the IP-list might get too big, and these might change more often than names.
    • Also this list is NOT
  • data format: plain text file with UTF8 character set
  • naming: robotlist.txt
  • in the file each line represents a robot
  • regular expressions can be used in each line according to ISO....
  • versioning
  • Web Reference: A "Cool URI" needs to be used to refer to the robot list. Yet to be determained!
    • The Web location (URL) of the Robot List is also Yet to be determained!
    • The maintanance of the Web reference is done at the moment when a new version of the list is minted
    • The maintanance of the Robotlist is done by Yet to be determained!

Data format for Usage Events



Transfer Protocol

PIRUS Scenario B: with OAI-PMH

  • No labels