OAI-PMH Metadata Augmentation Report
M.Moffat, S. Chumbe and R. MacLeod S.Chumbe@hw.ac.uk - Ver 1.0 (28/03/07)
- Possible Metadata Enhancements
- Implemented Metadata Enhancements
In an ideal world, OAI Service Providers would simply harvest metadata via OAI-PMH, index the collections locally and make them available for searching. However, experience gained from PerX as well as a range of other projects (e.g. NSDL (1), Stargate (2)), reveals that in practice this ideal is rarely the case. The reality is that, all too often, much of the metadata produced by data providers contains errors and omissions which can cause problems for service providers, or, at worst case scenarios, make the metadata unusable. Dushay and Hillman (1) identified four categories of problems with OAI metadata:
- Missing Data. Often this is due to the fact that the entire collection consists of a particular type of information, and the metadata is probably deemed unnecessary for the collections' local purposes (e.g. journal title, where all items in the collection are from the same journal, or data type, where all of the items in the collection are images).
- Incorrect Data e.g. metadata is presented in wrong or inappropriate fields.
- Confusing Data e.g. html tagging within metadata elements, illegal XML characters.
- Insufficient Data e.g. limitations of simple Dublin Core for adequately describing resources. In many cases qualified Dublin Core or other richer metadata formats would be more useful, but often these are not available from Data Providers.
2. Possible Metadata Enhancements
The PerX team identified a possible four tiered approach with respect to possible metadata enhancements for the PerX pilot cross search service:
- General Safe Transforms (‘normalisations’) which would ultimately produce valid XML files for consumption by PerX. These normalisations would;
- remove non-standard characters in the metadata (i.e. smart quotes, long dashes, TM, etc).
- remove unnecessary spaces between < tags >.
- remove html markup that invalidates XML.
- remove empty metadata elements and double XML encodings.
- Basic Collection Specific Transforms which would aim to ensure that the harvested data from particular collections was actually usable by PerX. For example, addition of specific fields to records (e.g. metadata records may need a journal title, or URL added). Other possibilities include the addition of rights information, publisher information, format details or the deletion of duplicate records.
- Work with individual metadata providers to help them enhance their own OAI-PMH metadata. It was envisaged that enhanced dialogue and information flow between data and service providers would often be helpful in resolving many of the issues with problem metadata. Such work might involve discussions around the preferred means of presenting information (e.g. format of author names), use of more detailed schemes than simple Dublin Core, or simply ensuring that all of the required data elements useful to Service Providers were presented in the metadata.
- Advanced metadata enhancements
Clearly, many more possibilities exist which could actually enhance the metadata exposed by data providers. Some possibilities considered by PerX included:
- Addition of appropriate Subject Terms. The eprintsUK Project proposed a Web Service (3) to enhance metadata in such a way, although the service itself never came to fruition.
- Checking Author names against a name authority file. Again the eprintsUK Project proposed such a Web Service (3) but this was not delivered.
- Harvested data link checker.
3. Implemented Metadata Enhancements
The metadata enhancements actually undertaken for the development of the PerX Pilot Cross Search Demonstrator are described below;
- General Safe Transforms (‘normalisations’) – While harvesting data via the Perx Administrative Interface (PAIN), all OAI-PMH resources were required to undergo an automated ‘normalisation’ process prior to their indexing and inclusion within the cross search service. The normalisation process utilises pattern matching techniques to:
- Convert UTF-8 encodings.
- Remove/correct invalid elements - e.g. remove spaces between < tags >, html markup, empty metadata elements, double XML encodings and vcards.
- Check for valid URLs in <dc:identifier> elements.
- Merge common elements.
- Basic Collection Specific Transforms – Bearing in mind the number of collections included in the PerX demonstrator (~40) and the additional number of collections investigated as possible for inclusion (~50) the most common collection specific transform used was simply to get a functional identifier (URI) with which to link to the resource. In some instances the <dc:identifier> field contained only an ID or ‘abstract number’ and not the full URI (e.g. ICE Virtual Library, Public STINET – DTIC Technical Reports). In these cases effort was required to establish appropriate URIs which was not always straightforward (for example different base URIs for different types of IDs within a collection, base URIs worked for some records but not others). In others, the identifier was misplaced and was presented in the record header rather than the record metadata itself. In some instances the collection specific transforms resulted in viable collections for cross-search. In others, despite considerable effort and multiple attempts to contact the appropriate system administrators, it proved impossible to resolve the outstanding issues, and the collections could not be used (e.g. Public STInet).
Other basic collection specific transforms included the merging of metadata elements (e.g. merging of <dc:subject> elements or merging of <dc:description> with other descriptive elements) and splitting of metadata elements (e.g. splitting of inappropriate <dc:title> which contained the title along with the creator metadata.
It is salient to point out that at least half of all OAI targets utilised for the PerX cross search required some form of manually performed basic collection transforms to be provided in addition to the normalisation process.
- Work with individual metadata providers. During the lifetime of the project, contact was attempted with a number of OAI data providers. In some instances this proved useful and the relationship was of benefit to both parties e.g.
- JORUM – PerX assisted with initial testing of the beta JORUM OAI repository. Feedback was provided to the data provider and resulted in the resolution of some minor issues (e.g. use of vcard data in <dc:creator>, problem with <dc:identifier> in relation to access to downloadable learning object).
- Australian Digital Thesis (ADT) – PerX provided feedback on issues regarding their use of <dc: identifier> which in some instances contained a full URI and in others did not. In this case, some of the collection was available full text online (valid URI in identifier) and in others the full text was not available online (‘abstract number’ in URI field). ADT elected to create an OAI set of only ‘Digital Copy’ items to address the issue.
- Oxford Journals - PerX provided feedback to Oxford Journals relating to empty OAI sets and a range of metadata issues (e.g. duplicate metadata records, un-resolvable DOIs, and use of the <dc: data> element).
- GRADE – PerX assisted with initial testing of beta GRADE OAI repository. Feedback was provided to GRADE regarding numerous non-functional links in <dc: identifier>
- Higher Education Academy Joint Engineering/Materials Subject Centre Resource Database - PerX assisted with the initial testing of a beta OAI repository. Feedback to the HE subject centre resulted in a number of minor fixes.
In other instances contact was attempted, sometimes via multiple routes, but no response was forthcoming from the data provider (e.g. OSTI, Public STInet).
- Advanced metadata enhancements.
No advanced metadata enhancements were possible within the timescales of the project.
A brief case study illustrating PerX experience of using OAI-PMH with Oxford Journals illustrates some of the above metadata augmentation issues.
The metadata augmentation implemented via the PerX project was relatively modest, and largely centered around resolving problems with harvested metadata rather than more advanced enhancements. In a presentation describing the building of the National Science Digital Library (NSDL) Dean Krafft (4) notes that "Metadata is expensive - unlike traditional libraries, digital collections have very 'mixed quality' metadata, with unusual and inconsistent coding." The result of these inconsistencies ultimately leads to a need for higher investments of staff time on behalf of service providers.
Although more advanced enhancements were effectively out of scope due to the time spent simply creating usable metadata, there is clearly much potential for other enhancements. A JISC funded study Metadata Generation for Resource Discovery is currently underway which aims to identify the most promising new techniques and approaches emerging from recent experimental research into automated metadata generation.
While there is clearly much interest in the potential of automated metadata generation and metadata enhancement, it is perhaps worth noting that JORUM's review of metadata automation systems (5) concluded that:
"the increased application of systems and process to automate metadata will not result – in the foreseeable future at least – in the obsolescence of the human in the metadata creation process.....much of the metadata which can be automatically generated relates to its technical properties, repository users will typically need more subjective metadata to enable them to assess their retrieval results."
 Dushay, N., Hillmann, D. (2003). Analyzing Metadata for Effective Use and Re-Use. [Available Online at
 Robertson, J.R. (2006). Stargate Final Report. [Available Online at
 ePrints UK - Web Service Interfaces. [Available Online at
 Krafft, D. (2006) Building a National Science Digital Library. Presentation [Available Online at
 Baird K, (2006) Automated Metadata - A review of existing and potential metadata automation within Jorum and an overview of other automation systems. [Available Online at