Goals of OA-Statistics
We aim to produce valid and reliable document usage statistics based solely on information gathered from the HTTP layer.
There are two main issues addressed by all existing standards which generate the bulk of the necessary corrections:
- Identification of non-human access
- Multi-Click correction
Besides this, we investigate the amount of data and effort necessary to produce complex statistics, for example, click-streams, without violating privacy laws. At the bottom of this page there is a comparison table including links to all standards mentioned. A detailed description of OA-S can be found at http://www.dini.de/projekte/oa-statistik/#c1203
Usage statistics - and even more important raw usage data - have to be described on an abstract level. It is not sufficient to define a derivative of the Apache Access Log as there is a multitude of different software solutions in use to operate a full text repository. Many do not even produce a log file let alone utilise an Apache Server.
Information needed to generate COUNTER, LogEc and IFABC
Note: The field names might still be subject to change as the project goes on.
OA-S-Fieldname |
Description |
COUNTER |
LogEc |
IFABC* |
---|---|---|---|---|
Document-Identifier |
non-ambigious label identifying the full text |
needed |
needed |
needed |
File Format |
File format of server reply (e.g. HTML orPDF) |
needed |
needed |
needed |
Service Type |
nature of server reply (e.g. full text,ab-stract) |
needed |
needed |
- |
Time of Request |
Time of request processing to the second |
needed |
needed |
needed |
IP |
IP-Adress of user (Client) |
needed |
needed |
IF Session-Identifier is not available: needed |
Session-Identifier |
server generated non-ambiguous session/visit label |
optional |
- |
IF IP is not available: needed |
User Agent |
User-Agent-String of the requesting client |
needed |
needed |
IF Session-ID is not available: needed |
HTTP Status Code |
Server-Status-Code of the HTTP-Requests |
needed |
needed |
needed |
Bytes sent |
server reply size |
- |
- |
IF File Format is not HTML: needed |
Additional pieces of information which comply with OpenURL Context Objects
The following fields are important to our advanced research interests and thus implemented from the beginning.
Referrer |
non-ambigious identifier of the server which created the ContextObject |
---|---|
Referring Entitiy |
non-ambigious label of the object of origin (e.g. the Abstract Page which links to the full text file) |
Additional suggestions
States and properties of the repository software have to be delivered from the available data.
Examples:
- Focus Page in Search Result Paging View
- ID of the current document
- Search arguments and result presentation
- Abstract Page vs. Fulltext Page
- Administrative actions
- Document upload
- Metadata allocation
There should be reliable information about the origin of the client (i.e. the referrer). For example, it should be possible to tell whether a client accessed the file via the frontpage or via a link in the repository's RSS-Feed.
In case of multiple server logs it is mandatory to synchronize the system time on all associated repository servers.
Table of Web Usage Standards
Provider URL |
Counting Clause |
Multi-Click Time Span |
User Identification |
Crawler Clause |
Crawler Identification |
Crawler Count Report |
---|---|---|---|---|---|---|
HTTP Status Code is 200 or 304. |
for HTML 10s; for PDF 30s |
at least IP, preferably Session |
robots, prefetches, caching, federated searches(n.a.) |
Black-List, client HTTP header |
separate report |
|
HTTP Status Code is 200, 206, 301, 302 or 304. |
one calendar month |
IP |
robots, automated downloads (wget) |
Access of robots.txt; # of requests 10,000 items/month; C-Class access 10% of stock; known robot-Domain/IP |
separate column in report |
|
HTTP Status code is 200 on abstract or full-text page |
24 hours |
IP |
search engine crawlers + automated |
AWStats' black list |
discarded |
|
Default: HTTP Status codes (200;304) |
Default: 1 hour |
IP |
search engine crawlers |
Black-List |
separate column in report |
|
HTML: Tracking Pixel; Other: bytes transferred 95% of file size |
Each Pageview is counted only once per visit. Visit means series of clicks coming from one IP-Number/Session-ID less than 30 minutes apart. |
IP+User-Agent; Cookie-Session, Login-Session |
search engine crawlers; automated downloads (optional) |
proprietary Blacklist |
discarded |