Pilot Engineering Repository Xsearch

PerX Logo

PerX Setup & Maintenance Issues

S. Chumbe & M.Moffat (S.Chumbe@hw.ac.uk) - Ver 1.0 (28/03/07)

Home>About>Deliverables>PerX Setup & Maintenance Issues

 

Contents

  1. Introduction
    1.1 Recording Setup and Maintenance Effort

  2. Setup & Maintenance of OAI Targets
    2.1 Brief Introduction to OAI targets
    2.2 Available Harvesting Tools (Review)
    2.3 Harvesting & Automated Harvesting via PAIN
    2.4 Quantified Estimate of OAI Setup & Maintenance Effort
    2.5 OAI-PMH Setup & Maintenance Challenges

  3. Setup & Maintenance of Z Targets
    3.1 Brief Introduction to Z targets
    3.2 Quantified Estimate of Z39.50 Setup & Maintenance Effort
    3.3 Z39.50 Setup & Maintenance Challenges

  4. Setup & Maintenance of Non Standard Targets
    4.1 Brief Introduction to Non Standard Targets
    4.2 Quantified Estimate of Non Standard Setup & Maintenance Effort
    4.3 Non Standard Target Setup & Maintenance Challenges

  5. Summary of Setup & Maintenance Effort

  6. Analysis and Conclusions
    6.1 Challenges for Subject Based Cross Search Services
    6.2 Possible Approaches to OAI-PMH Challenges
    6.3 Conclusions

  7. Appendices
    Appendix A. Forms for Recording Setup & Maintenance Effort
    Appendix B. Tabular Recording of Setup and Maintenance Effort
    Appendix C. Brief Review of OAI-PMH Tools (July 2005)
    Appendix D. Brief Review of OAI Service Providers (July 2005)
    Appendix E. Quantified Estimate of OAI Setup & Maintenance Effort
    Appendix F. Quantified Estimate of Z39.50 Setup & Maintenance Effort
    Appendix G. Quantified Estimate of Non Standard Target Setup & Maintenance Effort

  8. References

1.Introduction

An important aspect of the PerX project was to consider the setup and maintenance issues relating to cross-search targets (including OAI-PMH, Z39.50 and non-standard) which were included in the PerX pilot service, and also to attempt to quantify the effort required. This document provides details of the set-up and maintenance effort expended by the PerX team for a 12 month period between March 2006 and February 2007. Issues arising from setting up and maintaining PerX cross-search targets are discussed, and conclusions are drawn.

1.1. Recording Setup and Maintenance Effort

At the start of the project, standard forms were designed to attempt to record details of OAI and Z39.50 target setup and maintenance for the duration of the project (Appendix A). After initial usage of these forms, it became clear that the overhead involved in maintaining this level of detail was prohibitive, so a simpler record of time spent on each collection was designed, using a simple table format. Each time a collection was considered, set up or required maintenance, the table was updated accordingly with the approximate times spent. Expended effort was categorised as analysis time, setup time and maintenance time (Appendix B). Note that all times are approximate and are estimated on a best effort basis.

The analysis time for each collection included estimated effort for the following elements:

[Note that time spent initially identifying suitable collections is not included in the setup time. Within the PerX project this effort was undertaken earlier in the project as a work package investigating available Engineering Repository Sources which itself constituted considerable effort].

The setup time included the following elements:

  • Addition of collection details into the PerX Administrative Interface (PAIN). This included creation of a general collection description (name, description, URL, logo etc) and input of the basic technical details required to utilise the collection (base URL for OAI, target details for Z39.50).
  • Initial Set-up in PerX. (Harvesting/normalisation/indexing for OAI targets, setup of Z39.50 targets to permit appropriate query syntax construction and parsing of search results).
  • Contact with Data Providers relating to the initial set-up of the collection.

The maintenance time indicates the time spent maintaining the collection after the initial collection setup effort was complete.

2. Setup & Maintenance of OAI Targets

2.1. Brief Introduction to OAI targets

Open Archive Initiative Protocol for Metadata Harvesting (OAI-PMH) is a simple protocol that allows data providers to expose their metadata for harvesting. It supports the regular gathering of metadata from one service to another. OAI-PMH is based on common underlying Web standards - HTTP, XML and XML schemas - which theoretically makes it a 'low barrier' approach for potential data providers who are already running a web server. However, as previously reported by PerX (Chumbe et al 2006), there are a number of obstacles which must be overcome when attempting to utilise this 'low barrier' OAI-PMH approach. Others have reported similar issues, for example, Lagoze et al (2006) commenting on the National Science Digital Library (NSDL) experience of using OAI-PMH, noted that it was "often harder to deploy than expected". The original expectations of the NSDL team, that the OAI-PMH approach would involve high levels of automation and low people costs, were contradicted by three years experience which instead led them to report that:

  • In a few cases the automated harvesting of metadata has "proceeded smoothly, but the vast majority of cases require significant manual intervention."
  • Harvesting failure rates for the NSDL project have "stubbornly hovered between 25-50%, necessitating constant and ongoing human intervention".
  • "the number of components and variables to be managed has frequently interfered with our efforts to handle the process automatically, forcing us to fall back on 'expensive' human intervention."

 

2.2. Available Harvesting Tools (Review)

Near the start of the PerX project (July 2005) a range of OAI-PMH tools were investigated as potential solutions for ongoing OAI harvesting and maintenance (Appendix C). These included OAI-PMH Pack, ARC, and the PKP OAI harvester. OAI-PMH Pack was deemed unsuitable as it utilised an inappropriate development environment for Perx (Python). From July 2005 to February 2006, ARC was utilised for harvesting as it was an integrated component of the Subject Portals Project (SPP) software initially used by PerX. Problems with ARC included incomplete error handling mechanisms, memory problems within the SPP context and a lack of ongoing support (ARC is now no longer maintained). During March 2006 the PKP OAI Harvester version 1 was trialled which proved to be a reasonable solution for small collections but was unable to deal adequately with larger collections or invalid XML. A decision was therefore taken in April 2006 to attempt to develop an in house harvester as part of the PerX Administrative Interface (PAIN). This approach appeared to be vindicated at the time, due to the fact that a number of OAI service providers (e.g. NSDL, OAIster, ARROW) were also investigated (Appendix D) and in a number of cases the software utilised for harvesting was developed in house or supplied commercially.

It is perhaps salient to note that high profile OAI Service Providers such as NSDL or OAIster have clearly expended considerable effort to implement automated harvesting approaches. The core integration team at NSDL which is responsible for metadata aggregation consists of around fifteen staff and has an annual budget of 4 million US dollars. OAIster utilise their own in-house software for automated harvesting (DLXS) and employ two full time software developers for OAI harvesting and data normalisation. In contrast, the PerX project employs a single part time software developer (0.25 FTE) for all technical aspects of the project.

2.3. Harvesting & Automated Harvesting via PAIN

The PerX Administrative Interface (PAIN) was developed to allow administration of search targets for the cross search. PAIN allows the addition of new targets of a range of types, including Z39.50, OAI-PMH, SRU and non-standard databases [Fig1]. For OAI-PMH repositories, once the details have been added, facilities are provided to allow administrators to Identify, Harvest, Normalise and Index these repositories.


Figure 1. PerX Administrative Interface (PAIN)

Manual harvesting of OAI-PMH repositories using this interface was partially successful. In some instances administrators were able to complete a harvest, and progress to normalisation and indexing of the collection. In other instances manual harvesting via PAIN proved problematic (see section 2.5) and technical intervention was required.

Automating the harvesting process via PAIN proved to be a considerable challenge for the following reasons:

  1. Repository Errors - Repository failed to respond to OAI-PMH requests or delivered error messages (e.g. timing out or service unavailable (503) errors, empty or incomplete results returned).
  2. Inability to Implement Incremental Harvesting - Theoretically, OAI-PMH supports incremental harvesting where a service provider performs a single initial full harvest followed by repeated smaller scale incremental harvests to keep the metadata up to date (e.g. added records, modified records, deleted records). However, incremental harvesting is often impossible due to the fact that support for deleted records is inconsistently implemented by data providers. In practice, often the only reliable way to ensure that a data providers' metadata is up to date is to perform another full harvest.
  3. Problems with Resumption Tokens - Resumption token errors occur or the repository does not provide resumption tokens (e.g. at time of writing JORUM does not provide resumption tokens despite the collection being over 1000 items).
  4. XML Errors - XML errors which cause the harvesting process to fail.

Despite investigating a number of third party harvesting tools and investing effort into automating the harvesting process, this part of the project proved difficult. Automatic harvesting was implemented for only a small number of PerX targets and this was achieved in the latter stages of the project. Basic automatic harvesting was enabled for five targets (ARC, GROW, Inderscience, Oxford Journals and SearchLT) which permitted these targets to be reharvested on a weekly basis.

2.4. Quantified Estimate of OAI Setup & Maintenance Effort

A detailed record of the effort required to setup and maintain OAI-PMH targets is provided in Appendix E.

Target Analysis & Setup. Twenty seven OAI targets were selected for inclusion in PerX. In total, approximately 231 hours were spent on the analysis and initial setup of these targets – an average of 8.5 hours per target. It is worth pointing out that a small number of OAI targets required a high level of setup effort, thus inflating the average figure. In addition, not all selected targets were successfully included in PerX due to various issues, and if this is taken into consideration the time spent per target for successful addition raises to 10.5 hrs per target.

Target Maintenance. Quantification of the maintenance effort required for OAI targets was problematic for the following reasons:

  • Automatic harvesting was not successfully implemented until the latter stages of the project. For the bulk of the project manual harvesting was therefore utilised.
  • The frequency of manual harvesting was generally low – most targets were re-harvested infrequently during the duration of the project although there were some exceptions (see Appendix E).
  • Automatic harvesting was implemented for only a minority of PerX targets (ARC, GROW, Inderscience, Oxford Journals and SearchLT). Some targets, especially some of the larger ones, tended to cause repeated problems, necessitating manual harvesting.
  • Additional targets were added in an ad-hoc nature throughout the project, as and when they became available.
  • The frequency of scheduled automated harvesting varied.

Bearing these limitations in mind, the approximate figure recorded by PerX staff averaged out at 3.4 hours per target. While this figure appears low, it is worth noting that in the PerX project emphasis was placed on achieving a ‘critical mass’ of engineering targets rather than ensuring all existing targets were up to date. Most targets were therefore harvested infrequently and clearly this would be inappropriate for live service provision.

2.5. OAI-PMH Setup & Maintenance Challenges

Examining the details of the recorded OAI setup and maintenance effort (Appendix E) reveals a number of challenges for OAI service providers:

  • Successful ongoing automated harvesting is difficult to implement and often requires manual intervention (see above).
  • Incremental harvesting via OAI-PMH is very often not possible.
  • OAI Servers may not support desirable features (e.g. JORUM does not support ListRecords or Resumption Tokens).
  • OAI Servers may not Validate (e.g. via validator at http://re.cs.uct.ac.za/)
  • Repositories may not respond to OAI requests or may deliver only error messages
    (e.g. OSTI)
  • OAI Sets are often Problematic. For example: sets are often not provided (e.g. NDLTD, ADT), sets are complex and require detailed analysis (e.g. Oxford Journals), sets are focused on data providers internal structure and make no sense to potential service providers.
  • Very commonly the XML file returned is not well formed
    (e.g. NASA)
  • Contact between data providers and service providers is frequently required
    (e.g. IOP, JORUM, ADT, Oxford Journals, IStructE)
  • OAI Administrators may not respond to queries
    (e.g. OSTI, STINet, Info Bridge)
  • Content exposed via a data provider's OAI-PMH interface may not match that available via the data providers own web interface
    (e.g. ADT, DIVA, IOP)
  • Field content in the exposed metadata may vary enormously
    (e.g. even the dc_language element provides much variation)
  • Links to the documents in the exposed metadata may not be included or do not work (e.g. STINet)
  • Access to the full text may be restricted in some way
    (e.g. ADT, Info Bridge)

The OAI-PMH Dublin Core approach, which is perhaps 'low barrier' from the perspective of data providers, creates substantial challenges from the perspective of service providers. The high level of flexibility in both the OAI-PMH specification and Dublin Core combine to create a series of obstacles for service providers which require constant and ongoing consultation with data providers, difficulties in successful automation of the harvesting process, and ultimately require a high level of expert human intervention. As Brogan (2006) notes "OAI-PMH's simplicity may translate into myriad problems for harvesters, especially in cases where data providers do not implement some of the 'optional features' that are most helpful to building aggregations. Consequently, service providers are often confronted by inconsistent, insufficient, or incomplete data that limit their ability to build meaningful aggregations for end-users".

OAI-PMH allows multiple forms of metadata to be exposed but mandates oai_dc as a minimum requirement.  While some of the metadata challenges which are evident could potentially be addressed via the exchange of more tightly constrained metadata formats, it is clear that many data providers have considerable difficulty dealing effectively with oai_dc (and perhaps the same could also be said for service providers).  The expectation of the provision of richer formats is simply unrealistic in many cases.  In the twenty-seven OAI-PMH repositories investigated for PerX, only five provided alternative metadata formats in addition to oai_dc.

3. Setup & Maintenance of Z Targets

3.1. Brief Introduction to Z targets

Z39.50 is a client/server protocol for searching and retrieving information from remote computer databases. It specifies procedures and formats for a client to search a database provided by a server, retrieve database records, and perform a number of related information retrieval functions such as sort and browse. Remote service providers, such as portals or aggregators, can query a Z39.50 interface and receive back search results in a standard reusable format. The metadata transferred in a Z39.50 search is not usually stored, and is used only transiently for the duration of a single search. Thus every search request generates a new query to the data provider database and the transfer of search results.

3.2. Quantified Estimate of Z39.50 Setup & Maintenance Effort

A detailed record of the effort required to setup and maintain Z39.50 targets is provided in Appendix F.

Target Analysis & Setup. Fourteen Z39.50 targets were selected for inclusion in PerX. In total, approximately 107 hours was spent on the analysis and initial setup of these targets – an average of 7.6 hours per target. This figure is slightly lower than the figure for analysis and setup of OAI targets (8.5 hours per target).

Target Maintenance. Quantification of maintenance effort for Z39.50 targets was relatively straightforward. In total the 14 targets required 20 hours of maintenance an average of 1.4 hours per target.

3.3. Z39.50 Setup & Maintenance Challenges

Examining the details of the recorded Z39.50 setup and maintenance effort (Appendix F) reveals a number of issues which resulted in effort required by the Perx Team;

  • Complex Z39.50 implementations which on occasion require contact with the data provider
    (e.g. CISTI, ePrintsUK)
  • Z39.50 Targets which don’t respond or return errors
    (e.g. Aerade, NAGP, PSIgate)
  • Z39.50 Targets which change location/names
    (e.g. EEVL & PSIgate became Intute)
  • Z39.50 Targets which are withdrawn from service
    (Aerade, NAGP, EESE, EEVL Websearch)

In addition to these problems, it was evident that Z39.50 targets occasionally experience transient downtime, with a result that they failed to function in the PerX pilot service. Often this would involve no actual maintenance for PerX personnel as these services simply come back up spontaneously. However, even in these cases, transient downtime does result in quality of service issues, as end users simply see that things are not working in the cross-search interface. In an attempt to inform PerX end users more clearly of these ‘transient problems’ warning messages and icons were utilised as follows;

  • warning gif Collection Not Responding- This collection is not responding to queries. The system administrator has been informed and will investigate further. Please try again later.
  • warning gif Collection Timed Out - This collection has timed out. The collection may be busy, please try again later.

To alert PerX service administrators of Z target downtime, a cron job was scheduled to automate testing of all Z targets and email the administrators when servers were not responding.

4. Setup & Maintenance of Non Standard Targets

4.1. Brief Introduction to Non Standard Targets

“The beauty of exposing metadata in a standard way is that little effort is required for third parties to reuse your metadata, and make it available to their visitors. Standard metadata is therefore an investment in current and future interoperability”
PerX Advocacy Material

While the focus of the PerX cross-search has always been on standard interoperable sources, it is clear that in practice some potential content providers are unable or unwilling to adopt this approach. Experience from the PerX advocacy work with a number of content providers indicates that many already share metadata with business partners via established proprietary means (e.g. FTP, XML gateways, custom XML schemas, etc). Persuading such players to invest in further mechanisms for standardised metadata exchange often proves to be an uphill struggle. While the concept of standardised metadata exchange sounds attractive, many already have real world solutions in place and are unwilling to expend what is seen to be additional effort to achieve 'standardised interoperability'. Often, PerX advocacy work was well received but ultimately resulted in content providers offering access to their metadata via non-standard means.

A small number of non standard targets with strong engineering content were therefore set up and maintained throughout the project. These included:

  • Institution of Civil Engineers (ICE) Virtual Library - Metadata obtained via email.
  • Emerald Engineering Journals - Metadata obtained via email.
  • Google - via Google API.
  • Pearson Education (Engineering Books) - Metadata obtained via FTP.

 

4.2. Quantified Estimate of Non Standard Target Setup & Maintenance Effort

A record of the effort required to setup and maintain non standard targets is provided in Appendix G.

Target Analysis & Setup. Four non standard targets were selected for inclusion in PerX. In total, approximately 38 hours was spent on the analysis and initial setup of these targets – an average of 9.5 hours per target. This figure is higher than the figures for analysis and setup of both OAI targets (8.5 hours per target) and Z39.50 targets (7.6 hours per target).

Target Maintenance. In total, the 4 non standard targets required 6 hours of maintenance, an average of 1.5 hours per target. Automatic updating of non standard targets was not achieved. Maintenance of the ICE target was not necessary due to the historical nature of the collection, and the Pearson target was manually updated four times via FTP. Plans to update the Emerald Engineering Journals collection via RSS are outstanding at the time of writing.

4.3. Non Standard Target Setup & Maintenance Challenges

No specific challenges relating to non standard targets were identified. Generally speaking, slightly more setup time was required for these targets and maintenance effort varied depending upon the approach taken.

5. Summary of Setup & Maintenance Effort

Table 1 summarises the overall set-up and maintenance effort for the 12 month period where details were recorded.

Target Type
Number of Targets

Total Analysis and Setup Effort
(hours)

Average Analysis & Setup effort
(hours)
Total Maintenance Effort
(hours)
Average Maintenance Effort
(hours)
OAI-PMH

27
231
8.5
92*
3.4*
Z39.50

14
107
7.6
20
1.4
Non-Standard

4
38
9.5
6
1.5
Totals
45
376
8.4
118*
2.6*
Table1 Overall Set-up and Maintenance Effort over 12 month Period.
(*Note: OAI-PMH targets were infrequently maintained during the project, see Section 2)
  • Overall, 45 targets were investigated and approximately 494 hours were spent solely setting up and maintaining targets.
  • The average target analysis and setup effort for OAI and Z39.50 targets was broadly similar (8.5hrs versus 7.6hrs respectively). Non Standard targets required marginally greater setup effort (9.5hrs).
  • Recording of maintenance effort was problematic for OAI targets (see section 2) but was straightforward for Z39.50 targets. The average maintenance effort figures for OAI and Z39.50 targets (3.4 versus 1.4hr respectively) disguise the fact that the OAI targets were poorly maintained and harvested infrequently throughout the pilot. The experience gained from the PerX project suggests that successful ongoing maintenance of OAI targets would require a mixture of automated and manual approaches and that the level of ongoing maintenance required for OAI targets in a live service would be relatively high. The PerX Technical Officer estimates that a more realistic figure for adequately maintaining OAI-PMH targets is likely to be in excess of the average analysis and setup effort (e.g. greater than 8.5hrs per target per annum).
  • Substantial effort was expended on targets which ultimately could not be included in the cross-search (e.g. STInet, OSTI) or targets which were included, but subsequently withdrawn (e.g. Aerade, EEVL, ePrintsUK, NAGP, PSIgate).

For service providers such as PerX, the effort required for set-up and maintenance of targets is therefore not insubstantial. The total figure of approximately 494 hours over a 12 month period, solely for set-up and maintenance, is likely to be a conservative estimate as some issues have undoubtedly been missed in the recording process. A relatively limited number of targets have been utilised for the pilot – many others are potentially available. Inclusion of more and more targets clearly escalates the set-up and maintenance effort required, which needs to be borne in mind in relation to the possible funding of actual cross-search services. It is also worth noting that the PerX technical team had considerable prior experience of both Z39.50 and OAI-PMH standards and that the recorded effort does not include identification of suitable targets or software development time. The high investment in staff time required perhaps goes some way to explaining the relative lack of similar subject based services which are currently in existence.

6. Analysis and Conclusions

6.1. Challenges for Subject Based Cross Search Services

An array of challenges exist for services which aim to provide subject based cross-search services utilising harvested data, distributed searching and non standard means. Furthermore, these challenges are different depending upon the means by which targets are cross searched.

  • For harvested sources (OAI-PMH), challenges typically include; automation of harvesting, malfunctioning repositories, inability to contact repository administrators, issues regarding sets, malformed XML, and metadata quality issues [See Oxford Journals Case Study]. From our experience it would be fair to conclude that while a minority of OAI-PMH services are professionally set up and maintained, many others are relatively immature, inadequately tested and are not well supported. Thus service providers must expend considerable effort consulting with data providers to overcome problem issues.
  • For distributed search targets (e.g. Z39.50) challenges include; dealing with complex Z39.50 implementations, target downtime and errors, target name changes and withdrawn targets. In addition, issues around transient target downtime, which themselves require no actual maintenance, do result in quality of service issues for service providers.

By including different types of targets (e.g. a hybrid cross search approach) end users clearly benefit from a wider range of content sources being available. Establishment of a 'critical mass' of relevant subject based materials is likely to be an important aspect in the creation of viable cross-search services, as discussed in the PerX Engineering Landscape Analysis. However, from a Service Providers viewpoint, each target type brings its own array of challenges which must be addressed in order to provide a reasonable level of service. In addition to the issues raised above, such a hybrid approach clearly raises issues for service providers relating to the subsequent search functionality which can be offered to end users. Perhaps the most obvious example relates to the fact that harvested data is held locally whereas distributed search data is not. The harvested data from multiple sources can be easily combined and manipulated, enabling functionality such as merging of results sets and relevance ranking. This, although not impossible, is much more difficult to achieve with distributed search targets and is a potential drawback of the hybrid cross-search approach.

6.2. Possible Approaches to OAI-PMH Challenges

There are many limitations with the OAI-PMH approach as noted in section 2.5. Possible approaches to these challenges include:

1. Recognise the limitations of OAI-PMH and continue to use 'as is' accepting the need for a high level of human investment. While subject based services such as the PerX pilot appear to have merits and have received substantial praise from end users, the decision as to whether this approach is sustainable in the long term is open to question. The perennial question is whether such subject based services can provide a real alternative to freely available web search engines such as Google or Google Scholar. One PerX user commented "Can this service better Google? If not you will struggle to make it worthwhile". While such comparisons are inevitable, they are perhaps somewhat misleading and unhelpful. Google and its ilk are firmly established as the first port of call for the vast majority of web users on the planet, period. As such they are now an integral part of people's online information seeking behaviour for any query in any area ranging from academic endeavor, to personal pursuits, leisure, news and beyond. Subject based cross-search services can at best provide niche services to a relatively small user base - these can never aspire to be an integral part of everyday online usage but instead may provide a useful additional resource discovery tool in certain circumstances.

2. Attempt to address some of the OAI-PMH issues. Shreeves et al (2005) suggests that "If OAI is going to mature into its full potential collections as well as item records need further development, and we need richer mechanisms of creating dialog between harvesters and providers". For example, application profiles could be used to tighten up the metadata that is exchanged via OAI-PMH. The ePrints DC Application Profile has been created for use by Institutional Repositories, and JISC has funded similar applications for still images and time based media and are scoping further application profiles for learning materials. Developing such application profiles and 'best practices' for the provision of metadata via OAI-PMH may help to alleviated some of the issues service providers have to overcome. However, while these are interesting developments, it is likely that considerable effort will be required to encourage the uptake and use of these profiles by Data Providers. Other OAI-PMH problems relating to the high flexibility in the OAI-PMH specification itself are perhaps even more problematic to address retrospectively. Overall, whilst it seems plausible that OAI-PMH may work well in the future, in particular niche areas such as Institutional Repositories it seems likely to remain an effort-intensive option for subject based services which aim to aggregate content of many varied types.

3. Abandon OAI-PMH harvesting as a practical option and invest effort in other approaches to provide subject based cross searching. Alternative options include distributed searching (e.g. Z39.50, SRU/SRW) or automated web crawling and indexing.

While distributed searching approaches have some advantages (e.g. lower maintenance effort) they also present their own challenges (see section 3.3). In addition, the PerX Engineering Digital Repositories Landscape Analysis has indicated that, in contrast to OAI-PMH, very few SRU/SRW targets currently exist. Automated web crawling approaches are certainly worthy of further investigation. As indicated by Lagoze et al (2006) there is a recognition "that the future of collection development in the NSDL relies upon deploying these technologies [automated web crawling and indexing] as a supplement and, in many cases, a replacement for the harvesting model".

While OAI-PMH undoubtedly creates many challenges Brogan (2006) asserts that there is "little doubt that the Open Archives Initiative Protocol for Metadata Harvesting has witnessed remarkable international adoption and growth since 2003...More than 1,000 OAI compliant archives are active across at least 46 countries with an estimated seven million links to full digital object representations...Adoption is likely to accelerate as more countries view OAI implementation as a fast-track to bringing increased visibility to indigenous scholarship." Bearing the fact in mind that a large and growing corpus of metadata is currently available via OAI-PMH it remains likely that potential service providers will continue to utilise this approach, 'warts and all'.

4. Engage in the development of emerging specifications aimed at exchanging digital objects. "There is growing awareness of the limitations of OAI-PMH and the Dublin Core metadata standard that underpin much of the current repository activity along with a call to develop a model and mechanism to handle complex objects held in repositories...in a more fully automated and interoperable way" (Brogan 2006). Initiatives have recently begun to emerge to address these limitations and address possible solutions. In the UK a Common Repositories Interfaces Working Group has been established to identify interfaces critical to the development of interoperable networks of repositories. In the US the OAI-ORE Project (Object Reuse and Exchange) is a recently funded initiative to develop specifications that allow distributed repositories to exchange information about their constituent digital objects. It is envisaged that the ORE framework will "permit fluid reuse, refactoring, and aggregation of scholarly digital objects and their constituent parts - including text, images, data, and software. This framework would include new forms of citation, allow the creation of virtual collections of objects regardless of their location, and facilitate new workflows that add value to scholarly objects by distributed registration, certification, peer review, and preservation services." While these various initiatives are undoubtedly promising, they are in their early stages with concrete specifications likely to be some years away.

 

6.3. Conclusions

  1. Overall, 45 potential targets were investigated for the PerX Pilot Service - 27 OAI-PMH, 14 Z39.50 and 4 non standard targets. Approximately 494 hours was spent solely setting up and maintaining search targets during a 12 month period between March 2006 and February 2007.

  2. The average target analysis and setup effort for OAI and Z39.50 targets was broadly similar (8.5hrs versus 7.6hrs respectively). Non Standard targets required marginally greater setup effort (9.5hrs).

  3. Maintenance of Z39.50 targets was relatively straightforward (1.4 hrs per target).

  4. Maintenance of OAI-PMH targets was problematic and harvested targets were in general, reharvested infrequently during the period of investigation.

  5. Successful ongoing maintenance of OAI targets would require a mixture of automated and manual approaches and the level of ongoing maintenance required for OAI targets in a live service would be relatively high.

  6. An array of challenges exist for services which aim to provide subject based cross-search services utilising harvested data, distributed searching and non standard means. Furthermore these challenges are different depending upon the means by which each type of target is being cross searched.

  7. Setup and maintenance of subject based cross-search services is complex and time intensive, even when the sources used utilise ‘standardised’ interoperability mechanisms. In particular, setup and maintenance of OAI-PMH sources present a number of challenges for service providers.

  8. Building subject based resource discovery services using technologies such as OAI-PMH is technologically feasible, but is likely to involve considerable and ongoing human effort to setup and maintain.

  9. Initiatives are underway which aim to address the limitations of OAI-PMH and repository interoperability, however concrete specifications are likely to be some years away.

7. Appendices

  • Appendix A. Forms for Recording Setup & Maintenance Effort
    [MS Word doc 64kb]
  • Appendix B. Tabular Recording of Setup and Maintenance Effort
    [MS Word doc 29kb]
  • Appendix C. Brief Review of OAI-PMH Tools (July 2005)
    [MS Word doc 28kb]
  • Appendix D. Brief Review of OAI Service Providers (July 2005)
    [MS Word doc 38kb]
  • Appendix E. Quantified Estimate of OAI Setup & Maintenance Effort
    [MS Word doc 105kb]
  • Appendix F. Quantified Estimate of Z39.50 Setup & Maintenance Effort
    [MS Word doc 72kb]
  • Appendix G. Quantified Estimate of Non Standard Target Setup & Maintenance Effort
    [MS Word doc 41kb]

8. References