Pilot Engineering Repository Xsearch

PerX Logo

Case Study: PerX Experience of Harvesting & Utilising Metadata from Oxford Journals

M.Moffat (M.Moffat@hw.ac.uk) - Ver 1.0 (28/03/07)


This brief case study intends to illustrate some of the types of issues encountered by OAI-PMH service providers attempting to utilise third party metadata obtained via OAI-PMH. It is worth noting that the case study describes only the experience of utilising OAI-PMH metadata from one data provider - a number of other possible OAI-PMH setup and maintenance challenges exist.

Initial Analysis & Setup of Target

November 2006 - Details of the new Oxford Journals OAI-PMH interface are made available at

Multiple OAI Sets are available via the Oxford Journals OAI-PMH interface. For example, there are sets allowing service providers to harvest metadata records for entire journal titles, particular journal volumes or individual journal issues.

Twenty three Engineering and Mathematics related Oxford Journal titles are identified as relevant by the Perx Team.

OAI-PMH set parameters for the relevant journals are extracted from details provided at the Oxford Journals site.

The initial harvest attempt via the Perx Administrative Interface (PAIN) reveals that the interface is unable to deal adequately with the large number of sets provided by Oxford Journals [This is a limitation of the PerX software]. Technical intervention required.

Initial test harvest reveals that three of the twenty three OAI-PMH sets are empty.

Oxford Journals contacted via email regarding the empty sets and problem is rectified after three weeks.

Twenty three relevant sets are successfully harvested by PerX resulting in approximately 36k records.

General safe transforms ‘normalisations' performed on harvested metadata which; Convert UTF-8 encoding's, remove spaces between < tags >, remove unnecessary html markup, remove empty metadata elements and double XML encodings.

Indexing of Oxford Journals metadata to enable searching via PerX cross search interface.


Metadata Augmentation

Analysis of Oxford journals metadata via the PerX cross search interface reveals the following issues:
  • Approximately 1500 records are found to be ‘Book Reviews' which are deemed inappropriate for inclusion in PerX. The metadata associated with these book reviews is not entirely consistent making it less straightforward to easily remove them.

  • There are a number of duplicate metadata records (around 600 metadata records have the same unique URL's).

  • Unnecessary HTML markup exists in a number of records which has not been removed by the normalisation process. This includes; &lt;sec&gt; ,&lt;it&gt; ,&amp; and &apos;

  • Some DOI's (links) in metadata records are un-resolvable.

  • Some DOI's (links) in metadata records link to restricted access items and result in the following message for end users "This item is Restricted to Maintenance Users Only. Please sign in with your Maintenance user name and password."

  • There appear to be issues with the <dc:date> field for some records. A number of metadata records have the the date 2006-12-19 in this field which does not correspond to the actual date of the article.

  • Around half of the metadata records have no abstract available.

Basic collection specific transforms manually conducted on the Oxford Journals collection to address some of these issues. These transforms included;
  • Removal of the majority of ‘Book Review' records via pattern matching.

  • Removal of duplicate records (i.e. those with duplicate DOI's)

  • Removal of additional unnecessary HTML markup not removed by the normalisation process

The Oxford Journals target is added to live PerX Pilot service.

Feedback passed to Oxford Journals via email regarding the issues encountered. At the time of writing Oxford are looking into the issues and aim to communicate with their OAI online hosts [Highwire Press] in order to resolve them.


Maintenance of Target

Work is ongoing to enable effective automatic harvesting of the Oxford Journals sets required by PerX