Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

KE is a co-operative effort between JISC, SURF, DEFF and DFG. International interoperability guidelines for the comparable exchange of usage data is one of these co-operative efforts.

Warning

This wiki has seen a lot of discussion lately with inline comments. For guidelines of how to place such comments, please read the ?guidelines for comments.

-- Max Kemman, 26-05-2010


Section
Column
width50%

Table of contents

Table of Contents
maxLevel3
Column
width50%
Wiki Markup

h1. Document information

| *Title:* KE Usage Statistics Guidelines \\
*Subject:* Usage Statistics,  Guidelines, Repositories, Publications, Research Intelligence \\
*Moderator:* Peter Verhaar ([KE Usage Statistics Work Group|standards:KE Usage Statistics Guidelines Work group]) \\
*Version:* 0.9.5 \\
*Date published:* 2010-05-18 \\
*Excerpt*: {excerpt}Guidelines for the exchange of usage statistics from a repository to a central server using OpenURL Context Objects via OAI-PMH and OpenURL Context Objectsor SUSHI.{excerpt}
\\  
\\
(Optional information) \\
*Type:* Guidelines, [info:eu-repo/semantics/technicalDocumentation |http://purl.org/info:eu-repo/semantics/technicalDocumentation]\\
*Format:* html/text \\
*Identifier:* [http://purl.org/REP/standards/KE Usage Statistics Guidelines] \\
*Language:* EN \\
*Rights:* CC-BY \\
*Tags:*{page-info:labels}\\ |

Date

version

Author

Description

PDF

2010-03-25

0.1

Peter Verhaar

First draft, based on technical specifications from the OA-Statistics project (written by Daniel Metje and Hans-Werner Hilse), the NEEO project (witten by Benoit Pauwels) and the SURE project (written by Peter Verhaar and Lucas van Schaik)

 

2010-04-13

0.2

Maurice Vanderfeesten

Added the sections based on the Knowledge Exchange meeting in Berlin. And filled in some additional information to these sections.

 

2010-04-24

0.9

Peter Verhaar

Revised version, in which comments made by Benoit Pauwels, Hans-Werner Hilse, Thobias Schäfer, Daniel Metje and Paul Needham have been incorporated.

 

2010-05-18

0.9.5

Peter Verhaar

DRAFT version 1.0 ; comments and layout improvements made by Jochen Schirrwagen, Max Kemman, Peter Verhaar and Maurice Vanderfeesten

 

Abstract

Guidelines for the exchange of usage statistics from a repository to a central server using OAI-PMH and OpenURL Context Objects.

2010-05-26

1.0

Peter Verhaar

Definitive version 1.0.; Comments made during the phone conference which took place on 25-05-2010 and which was attended by Thobias Schäfer, Hans-Werber Hilse, Jochen Schirrwagen, Marek Imialek, Paul Needham, Peter Verhaar, Maurice Vanderfeesten, Natalia Manola and Lefteris

 

Abstract

Guidelines for the exchange of usage statistics from a repository to a central server using OAI-PMH and OpenURL Context Objects.

Info

This page is maintained by: KE

Info

This page is maintained by: KE Usage Statistics Work Group

...

Project       

Status       

SURF Sure

 

PIRUS2

 

OA-Statistics  

 

NEEO

 

COUNTER

 

PLoS

 


Anchor
introduction
introduction

...

Virtually all web servers that provide access to electronic resources record usage events as part of their log files. Such files usually provide detailed information on the documents that have been requested, on the users that have initiated these requests, and on the moments at which these requests took place. One important difficulty is the fact that these log files are usually structured according to a proprietary format. Before usage data from different institutions can be compared in a meaningful and consistent way, the log entries need to be standardised and normalised. Various projects have investigated how such data harmonisation can take place. In the MESUR project, usage data have been standardised by serialising the information from log files into XML files structured according to the OpenURL Context Objects schema (Bollen and Van de Sompel, 2006). This same standard is recommended in the JISC Usage Statistics Final Report. Using this metadata standard, it becomes possible to set up an infrastructure in which usage data are aggregated within a network of distributed repositories. The PIRUS-I project (Publishers and Institutional Repository Usage Statistics), which was funded by JISC, has investigated how such exchange of usage data can take place. An important outcome of this project was a range of scenarios for the "creation, recording and consolidation of individual article usage statistics that will cover the majority of current repository installations" "Developing a global standard to enable the recording, reporting and consolidation of online usage statistics for individual journal articles hosted by institutional repositories, publishers and other entities (Final Report)", p.3. <http://www.jisc.ac.uk/media/documents/programmes/pals3/pirus_finalreport.pdf>.

In Europe, at least three four projects have experimented with these recommendations and have actually implemented an infrastructure for the central accumulation of usage data: The German OA-Statistics <http://www.dini.de/projekte/oa-statistik

  1. PIRUS is a project which aims to develop COUNTER-compliant usage reports at the individual article level that can be implemented by any entity (publisher, aggregator, IR, etc.,) that hosts online journal articles. The project will enable the usage of research outputs to be recorded, reported and consolidated at a global level in a standard way. 
  2. Te German OA-Statistics <http://www.dini.de/projekte/oa-statistik/> project, which is funded DINI (Deutsche Initiative für Netzwerk Information), has set up an infrastructure in which various certified repositories across Germany can exchange their usage data.
  3. In the Netherlands, the project Statistics on the Usage of Repositories <http://www.surffoundation.nl/nl/projecten/Pages/SURE.aspx> (SURE) has a very similar objective. The project, which is funded by SURFfoundation, aimed to find a method for the creation of reliable and mutually comparable usage statistics and has implemented a national infrastructure for the accumulation of usage data.

...

  1. The Network of European Economists Online <http://www.neeoproject.eu/> (NEEO) is an international consortium of 18 universities which maintains a subject repository that provides access to the results of economic research. As part of this project, extensive guidelines have been developed for the creation of usage statistics. NEEO has also developed an aggregator for usage statistics. The central database is exposed via a web service which can provide information on the number of downloads for each publication.

Whereas these three four projects all make use of the OpenURL Context Object standard, some subtle differences have emerged in the way in which this standard is actually used. Nevertheless, it is important to ensure that statistics are produced in exactly the same manner, since, otherwise, it would be impossible to compare metrics produced by different projects. With the support of Knowledge Exchange, a collaborative initiative for leading national science organisations in Europe, an initiative was started to align the technical specifications of these various projects. This document is a first proposal for international guidelines for the accumulation and the exchange of usage data. The proposal is based on a careful comparison of the technical specifications that have been developed by these three projects.

...

This results in the schema for SURE. See figure 2. The context of this document is limited to the usage data, as sent from the institutional repository to the log harvester.

Image AddedImage Removed
Figure 2.

4. Data format

...

A distinction will be made between the core set and extensions. Data in the core set can be recorded using standard elements or attributes that are defined in the OpenURL Context Object schema. The extensions are created to record aspects of usage events which cannot be captured using the official schema. They have usually been defined in the context of individual projects to meet very specific demands. Nevertheless, some of the extensions may be relevant for other projects as well. They are included here to inform the usage statistics community what additional information could be made available. Naturally, the implementation of all the extension elements are is optional.

Note

There are also other profiles we coud incorporate the best practices from, see http://alcme.oclc.org/openurl/docs/pdf/SanAntonioProfile.pdf

...

Description

An identification of a specific usage event.

XPath

ctx:context-object/@identifier

Usage

Optional

Format

No requirements are given for the format of the identifier. If this optional identifier is used, it must be (1) opaque and (2) unique for a specific usage event.
In the Netherlands a MD5 Hash is made from the concatenated string combining the code of the institution, identifier of the publication and the timestamp.
Other projects use different ways to create an identifier, as long as it is globally unique.

Example

b06c0444f37249a0a8f748d3b823ef2a

Note

This must be mandatory, at least to identify the repository. This can be done for example to use the repository name as a prefix prior to the opaque identifier. This could be categorised as provenance information. - Jochen

Note

Usage events are good to use as a control mechanism for the aggregation database to check if duplicate events are harvested. The provenance information of the repository can be found in the <resolver> element, see section 4.1.6. - Peter

Occurences of child elements in Occurences of child elements in <context-object>

Within a <context-object> element, the following subelements can be used:

Element name

minOccurs

maxOccurs

Referent

1

1

ReferringEntity

0

1

Requester

1

1

ServiceType

1

1

Resolver

1

1

Referrer

0

1

Note

Just a note: If we make this schema more restrictive, we diverge from the original schema. - Jochen

Note

This is explained in the beginning of section 4. Dataformat

4.1.2. <referent>

 

4.1.2. <referent>

The <referent> element must provide information on the document that is requested. More specifically, it must record the following data elements.

...

Description

A globally unique identification of the resource that is requested must be provided if there is one that is applicable to the document. Identifiers should be independent of the be 'communication protocol'-independent as much as possible. In the case of a request for an object file, the identifier should enable the aggregator to obtain the object's associated metadata file.

XPath

ctx:context-object/ctx:referent/ctx:identifier

Usage

Mandatory if applicable

Format

URI

Example

http://hdl.handle.net/1887/12100

...

What worries me that I don't see such a list there. Should this not state: "A referrer name must be uniquely identifier using a Source Identifier (info:sid)

.

Concatenating the info:sid namespace with the internet domain name, subdomain name or host name. "Question: When such list exists, must the application check this string with this list? - Maurice

Description

The referrer may be categorised on the basis of a limited list of known referrers. All permitted values will be registered in the OpenURL registry.

Note

XPath

ctx:referring-entity/ctx:identifier

Usage

Optional

Format

A URI that is registered in http://info-uri.info/registry/OAIHandler?verb=GetRecord&metadataPrefix=reg&identifier=info:sid/

Example

info:sid/google

Note

Peter, the info:sid documentation tells us that this name may refer to an Internet domain name, subdomain name, or host name. The examples at indo-uri.info show

  • info:sid/amazon.co.uk
  • info:sid/oclc.org
  • info:sid/amazon.co.uk:books
  • info:sid/oclc.org:inspec
    Should the example used above not also be info:sid/google.com ?
    I am unfamiliar on this terrain, maybe I am wrong and info:sid/google is allowed.
    I don't know what is in a apache logfile that can be easily used to transform in an info:sid. Should a real example not be reflecting something like info:sid/www.google.nl/search?q="Who+controls+the+internet" ? (in other words, how intelligent do implementers need to make their software for creating a info:sid ?)
  • or should this be stripped down to info:sid/www.google.nl
  • or even without the subdomain www info:sid/google.nl
  • or must the top domain be changed to .com info:sid/google.com
  • or indeed only the domain name, without the top-domain info:sid/google
  • or only the host info:sid/www
    I know I am just out of my league here, but maybe other people who implement these guidelines have similar questions in the future.
    Maybe Benoit knows more about this.
    All the best
    -- Maurice

 

 

4.1.4. <requester>

The user who has sent the request for the file is identified in the <requester> element.

...

Description

The user can be identified by providing the IP-address. Including the full IP-address in the description of a usage event is not permitted by international privacy laws. For this reason, the IP-address needs to be obfuscated. The IP-address must be hashed using MD5 encryption. MD5 encryption of IP addresses can easily be hacked. The question if such MD5 encryption secures the privacy sufficiently warrant further research by legal advisors.

XPath

ctx:context-object/ctx:requester/ctx:identifier

Usage

Mandatory

Format

A data-URI, consisting of the prefix "data:,", followed by a 32-digit hexadecimal number.

Example

data:,c06f0464f37249a0a9f848d4b823ef2a

...

the privacy sufficiently warrant further research by legal advisors. The IP address of the requester is pseudonymised using encryptions, before it is exchanged and taken outside the web-server to another location. Therefore individual users can be recognised when aggregated from distributed repositories, but cannot be referred back to a 'natural person'. This method may seem consisted with the European Act for Protection of Personal data. The summary can be found here: ?

http://europa.eu/legislation_summaries/information_society/l14012_en.htmImage Added

. Further legal research needs to be done if this method is sufficient to protect the personal data of a 'natural person', in order to operate within the boundaries of the law.

...

XPath

ctx:context-object/ctx:requester/ctx:identifier

Usage

Mandatory

Format

A data-URI, consisting of the prefix "data:,", followed by a 32-digit hexadecimal number.

Example

data:,c06f0464f37249a0a9f848d4b823ef2a

 

 

<requested/identifier> | C-class Subnet

...