Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

The institution that is responsible for the repository that contains the requested document is referred to as a usage data provider. Data can be stored locally in a variety of formats, but to allow for a meaningful central collection of data, usage data providers must be able to expose the data in a standardised data format, so that they can be harvested and transferred to a central database. The institution that manages the central database is referred to as the usage data aggregator. The data must be transferred using a well-defined transfer protocol. The data aggregator harvests individual usage data providers minimally on a daily basis, and bears the primary responsibility for synchronising the local and the central data. Ultimately, certain services can be built on the basis of the data that have been accumulated.
The approach that is proposed here coincides largely with scenario B that is described in the final report of PIRUS1. In this scenario, "the generated OpenURL entries are sent to a server hosted locally at the institution, which then exposes those entries via the OAI-PMH for harvesting by an external third party".
The main advantages of this strategy is that that normalisation does not have to be carried out by individual repositories. Once the data have been received by the log aggregator, the normalisation rules can be applied consistently to all data. Since local repositories only need to make sure that their data can be exposed for harvesting, the implementation should be much easier.

Note

please put a reference to the PIRUS1 outcome, and write a little more details about the three scenario's envisioned by PIRUS1

Anchor
dataformat
dataformat

...

A distinction will be made between the core set and extensions. Data in the core set can be recorded using standard elements or attributes that are defined in the OpenURL Context Object schema. The extensions are created to record aspects of usage events which cannot be captured using the official schema. They have usually been defined in the context of individual projects to meet very specific demands. Nevertheless, some of the extensions may be relevant for other projects as well. They are included here to inform the usage statistics community what additional information could be made available. Naturally, the implementation of all the extension elements are optional.

Warning

There are also other profiles we coud incorporate the best practices from

see http://alcme.oclc.org/openurl/docs/pdf/SanAntonioProfile.pdf

4.1. Core set

4.1.1. <context-object>

...

Description

An identification of a specific usage event.

XPath

ctx:context-object/@identifier

Usage

Optional

Format

No requirements are given for the format of the identifier. If this optional identifier is used, it must be (1) opaque and (2) unique for a specific usage event.

Example

b06c0444f37249a0a8f748d3b823ef2a

Warning

This must be mandatory, at least to identify the repository. This can be done for example to use the repository name as a prefix prior to the opaque identifier.

This could be categorised as provenance information.

-jochen

Occurences of child elements in <context-object>

...

Element name

minOccurs

maxOccurs

Referent

1

1

ReferringEntity

0

1

Requester

1

1

ServiceType

1

1

Resolver

1

1

Referrer

0

1

Note

Just a note: If we make this schema more restrictive, we diverge from the original schema. -Jochen

4.4.1.2. <referent>

The <referent> element must provide information on the document that is requested. More specifically, it must record the following data elements.

...

Description

The request type specifies if the request is for an object file or a metadata record.

XPath

ctx:context-object/ctx:service-type/ctx:metadata-by-val/ctx:metadata/dcterms:type

If this element is used, the <metadata> element must be preceded by

ctx:requester/ctx:metadata-by-val/ctx:format
with value
"http://dublincore.org/documents/2008/01/14/dcmi-terms/"

Inclusion

Mandatory

Format

Two values are allowed:

Example

info:eu-repo/semantics/objectFile

Note

In the CORE set this is a mandatory field, that allows these two values. However, in the future, extensions can be made with optional other values from a controlled vocabulary, for example for datasets.

4.1.6. <resolver> and <referrer>

...

User Agent

 

Description

The full HTTP user agent string

XPath

ctx:context-object/ctx:requester/ctx:metadata/dini:requesterinfo/dini:classification/dini:user-agentIf this element is used, the <metadata> element must be preceded by
ctx:requester/ctx:metadata-by-al/ctx:format
with value
"http://dini.de/namespace/oas-requesterinfo"

Usage

Optional

Format

String

Example

Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9.0.6) Gecko/2009011913 Firefox/3.0.6 (.NET CLR 3.5.30729) Firefox/3.0.6 (.NET CLR 3.5.30729)

4.3 Legal issues

Usage of IP addresses and the protection of a 'natural person'

The IP address of the requester is pseudonymised using encryptions, before it is exchanged and taken outside the web-server to another location. Therefore individual users can be recognised when aggregated from distributed repositories, but cannot be referred back to a 'natural person'. This method may seem consisted with the European Act for Protection of Personal data. The summary can be found here: ?http://europa.eu/legislation_summaries/information_society/l14012_en.htm. Further legal research needs to be done if this method is sufficient to protect the personal data of a 'natural person', in order to operate within the boundaries of the law.

5. Transfer Protocols

5.1. OAI-PMH

...

  • Usage of Sets see OAI-PMH, 2.7.2OAI2 OAI-PMH optionally allows for structuring the offered data in "sets" to support selective harvesting of the data. Currently, this possibility is not further specified in these guidelines. Future refinements may use this feature, e. g. for selecting usage data for certain services. Provenance information is already included in the Context Objects.
  • Datestamps, Granularity see OAI-PMH, 2.7.1(also compare the notes about datestamps in the OAI-PMH record header versus datestamps within the Context Objects)The OAI-PMH specification allows for either exact-to-the-second or exact-to-the-day granularity for record header datestamps. The data providers may chose one of these possibilities. The service provider will most certainly rely on overlapping harvesting, i. e. the most recent datestamp of the harvested data is used as the "from" parameter for the next OAI-PMH query. Thus, the data provider will provide some records that have been harvested before. Duplicate records are matched by their identifiers (those in the OAI-PMH record header) and are silently tossed if their datestamp is not renewed (see notes below on deletion tracking).It is strongly recommended to implement exact-to-the-second datestamps to keep redundancy of the transferred data as low as possible.
  • Deletion tracking OAI-PMH, 2.5.1The OAI-PMH provides functionalities for the tracking of deletion of records. Compared to the classic use case of OAI-PMH (metadata of documents) the use case presented here falls into a category of data which is not subject to long-term storage. Thus, the tracking of deletion events does not seem critical since the data tracking deletions would summarize to a significant amount of data.However, the service provider will accept information about deleted records and will eventually delete the referenced information in its own data store. This way it is possible for data providers to do corrections (e. g. in case of technical problems) on wrongly issued data.It is important to note that old data which rotates out of the data offered by the data provider due to its age will not to be marked as deleted for storage reasons. This kind of data is still valid usage data, but not visible anymore.The information about whether a data provider uses deletion tracking has to be provided in the response to the "identify" OAI-PMH query within the <deletedRecords> field. Currently, the only options are "transient" (when a data provider applies or reserves the possibility for marking deleted records) or "no".The possible cases are:
    • Incorrect data which has already been offered by the data provider shall be corrected. There are two possibilities:
      • Re-issuing of a corrected set of data carrying the same identifier in the OAI-PMH record header as the set of data to be corrected, with an updated OAI-PMH record header datestamp.
      • When the correction is a full deletion of the incorrect issued data, the OAI-PMH record has to be re-issued without a Context Object payload, with specified "<deleted>" flag and updated datestamp in the OAI-PMH record header.
    • Records that fall out of the time frame for which the data provider offers data: These records are silently neglected, i. e. not offered via the OAI-PMH interface anymore, without using the deletion tracking features of OAI-PMH.
  • Metadata formats OAI-PMH, 3.4All data providers have to provide support for <context-object> documents or <context-objects> aggregations, respectively.This choice also has to be announced in the response to the "listMetadataFormats" query OAI-PMH, 4.4 by the data provider. While a specific "metadataPrefix" is not required, the information about "metadataNamespace" and "schema" is fixed for implementations:

<metadataFormat> <metadataPrefix>ctxo</metadataPrefix> <schema>http://www.openurl.info/registry/docs/xsd/info:ofi/? ?fmt:xml:xsd:ctx</schema><metadataNamespace>info:ofi/fmt:xml:xsd:ctx</metadataNamespace></metadataFormat>

Info

Using OAI-PMH, the mandatory MetadataPrefix for UpenURL Context Objects will be: "ctxo"

  • Inclusion of Context Objects in OAI-PMH recordsCorresponding to the definition of XML encoded Context Objects as data format of the data exchanged via the OAI-PMH, the embedding is to be done conforming to the OAI-PMH:

...


Individual usage data providers should not filter doubles clicks. This form of normalisation should be carried out on a central level by the aggregator.

Note

By default the KE guidelines follow the COUNTER rules, in our context in order to deliver comparable statistics compared to publishes.

6.2. Robot filtering

6.2.1. Definition of a robot

The "user" as defined in section 2 of this report is assumed to be a human user. Consequently, the focus of this document is on requests which have consciously been initiated by human beings. Automated visits by internet robots must be filtered from the data as much as possible.

Note

It would be nice to have some reference. - jochen

6.2.2. Strategy

It is decided to make a distinction between two 'layers' of robot filtering:

...