Pilot Engineering Repository Xsearch
|Investigating resource discovery issues in engineering digital repositories||| Home | About | Deliverables | Links | Pilot ||
S.Chumbe@hw.ac.uk) - Ver 1.0 (28/03/07)
An important aspect of the PerX project was to consider the setup and maintenance issues relating to cross-search targets (including OAI-PMH, Z39.50 and non-standard) which were included in the PerX pilot service, and also to attempt to quantify the effort required. This document provides details of the set-up and maintenance effort expended by the PerX team for a 12 month period between March 2006 and February 2007. Issues arising from setting up and maintaining PerX cross-search targets are discussed, and conclusions are drawn.
At the start of the project, standard forms were designed to attempt to record details of OAI and Z39.50 target setup and maintenance for the duration of the project (Appendix A). After initial usage of these forms, it became clear that the overhead involved in maintaining this level of detail was prohibitive, so a simpler record of time spent on each collection was designed, using a simple table format. Each time a collection was considered, set up or required maintenance, the table was updated accordingly with the approximate times spent. Expended effort was categorised as analysis time, setup time and maintenance time (Appendix B). Note that all times are approximate and are estimated on a best effort basis.
The analysis time for each collection included estimated effort for the following elements:
[Note that time spent initially identifying suitable collections is not included in the setup time. Within the PerX project this effort was undertaken earlier in the project as a work package investigating available Engineering Repository Sources which itself constituted considerable effort].
The setup time included the following elements:
The maintenance time indicates the time spent maintaining the collection after the initial collection setup effort was complete.
Open Archive Initiative Protocol for Metadata Harvesting (OAI-PMH) is a simple protocol that allows data providers to expose their metadata for harvesting. It supports the regular gathering of metadata from one service to another. OAI-PMH is based on common underlying Web standards - HTTP, XML and XML schemas - which theoretically makes it a 'low barrier' approach for potential data providers who are already running a web server. However, as previously reported by PerX (Chumbe et al 2006), there are a number of obstacles which must be overcome when attempting to utilise this 'low barrier' OAI-PMH approach. Others have reported similar issues, for example, Lagoze et al (2006) commenting on the National Science Digital Library (NSDL) experience of using OAI-PMH, noted that it was "often harder to deploy than expected". The original expectations of the NSDL team, that the OAI-PMH approach would involve high levels of automation and low people costs, were contradicted by three years experience which instead led them to report that:
Near the start of the PerX project (July 2005) a range of OAI-PMH tools were investigated as potential solutions for ongoing OAI harvesting and maintenance (Appendix C). These included OAI-PMH Pack, ARC, and the PKP OAI harvester. OAI-PMH Pack was deemed unsuitable as it utilised an inappropriate development environment for Perx (Python). From July 2005 to February 2006, ARC was utilised for harvesting as it was an integrated component of the Subject Portals Project (SPP) software initially used by PerX. Problems with ARC included incomplete error handling mechanisms, memory problems within the SPP context and a lack of ongoing support (ARC is now no longer maintained). During March 2006 the PKP OAI Harvester version 1 was trialled which proved to be a reasonable solution for small collections but was unable to deal adequately with larger collections or invalid XML. A decision was therefore taken in April 2006 to attempt to develop an in house harvester as part of the PerX Administrative Interface (PAIN). This approach appeared to be vindicated at the time, due to the fact that a number of OAI service providers (e.g. NSDL, OAIster, ARROW) were also investigated (Appendix D) and in a number of cases the software utilised for harvesting was developed in house or supplied commercially.
It is perhaps salient to note that high profile OAI Service Providers such as NSDL or OAIster have clearly expended considerable effort to implement automated harvesting approaches. The core integration team at NSDL which is responsible for metadata aggregation consists of around fifteen staff and has an annual budget of 4 million US dollars. OAIster utilise their own in-house software for automated harvesting (DLXS) and employ two full time software developers for OAI harvesting and data normalisation. In contrast, the PerX project employs a single part time software developer (0.25 FTE) for all technical aspects of the project.
The PerX Administrative Interface (PAIN) was developed to allow administration of search targets for the cross search. PAIN allows the addition of new targets of a range of types, including Z39.50, OAI-PMH, SRU and non-standard databases [Fig1]. For OAI-PMH repositories, once the details have been added, facilities are provided to allow administrators to Identify, Harvest, Normalise and Index these repositories.
Manual harvesting of OAI-PMH repositories using this interface was partially successful. In some instances administrators were able to complete a harvest, and progress to normalisation and indexing of the collection. In other instances manual harvesting via PAIN proved problematic (see section 2.5) and technical intervention was required.
Automating the harvesting process via PAIN proved to be a considerable challenge for the following reasons:
Despite investigating a number of third party harvesting tools and investing effort into automating the harvesting process, this part of the project proved difficult. Automatic harvesting was implemented for only a small number of PerX targets and this was achieved in the latter stages of the project. Basic automatic harvesting was enabled for five targets (ARC, GROW, Inderscience, Oxford Journals and SearchLT) which permitted these targets to be reharvested on a weekly basis.
A detailed record of the effort required to setup and maintain OAI-PMH targets is provided in Appendix E.
Target Analysis & Setup. Twenty seven OAI targets were selected for inclusion in PerX. In total, approximately 231 hours were spent on the analysis and initial setup of these targets – an average of 8.5 hours per target. It is worth pointing out that a small number of OAI targets required a high level of setup effort, thus inflating the average figure. In addition, not all selected targets were successfully included in PerX due to various issues, and if this is taken into consideration the time spent per target for successful addition raises to 10.5 hrs per target.
Target Maintenance. Quantification of the maintenance effort required for OAI targets was problematic for the following reasons:
Bearing these limitations in mind, the approximate figure recorded by PerX staff averaged out at 3.4 hours per target. While this figure appears low, it is worth noting that in the PerX project emphasis was placed on achieving a ‘critical mass’ of engineering targets rather than ensuring all existing targets were up to date. Most targets were therefore harvested infrequently and clearly this would be inappropriate for live service provision.
Examining the details of the recorded OAI setup and maintenance effort (Appendix E) reveals a number of challenges for OAI service providers:
The OAI-PMH Dublin Core approach, which is perhaps 'low barrier' from the perspective of data providers, creates substantial challenges from the perspective of service providers. The high level of flexibility in both the OAI-PMH specification and Dublin Core combine to create a series of obstacles for service providers which require constant and ongoing consultation with data providers, difficulties in successful automation of the harvesting process, and ultimately require a high level of expert human intervention. As Brogan (2006) notes "OAI-PMH's simplicity may translate into myriad problems for harvesters, especially in cases where data providers do not implement some of the 'optional features' that are most helpful to building aggregations. Consequently, service providers are often confronted by inconsistent, insufficient, or incomplete data that limit their ability to build meaningful aggregations for end-users".
OAI-PMH allows multiple forms of metadata to be exposed but mandates oai_dc as a minimum requirement. While some of the metadata challenges which are evident could potentially be addressed via the exchange of more tightly constrained metadata formats, it is clear that many data providers have considerable difficulty dealing effectively with oai_dc (and perhaps the same could also be said for service providers). The expectation of the provision of richer formats is simply unrealistic in many cases. In the twenty-seven OAI-PMH repositories investigated for PerX, only five provided alternative metadata formats in addition to oai_dc.
Z39.50 is a client/server protocol for searching and retrieving information from remote computer databases. It specifies procedures and formats for a client to search a database provided by a server, retrieve database records, and perform a number of related information retrieval functions such as sort and browse. Remote service providers, such as portals or aggregators, can query a Z39.50 interface and receive back search results in a standard reusable format. The metadata transferred in a Z39.50 search is not usually stored, and is used only transiently for the duration of a single search. Thus every search request generates a new query to the data provider database and the transfer of search results.
A detailed record of the effort required to setup and maintain Z39.50 targets is provided in Appendix F.
Target Analysis & Setup. Fourteen Z39.50 targets were selected for inclusion in PerX. In total, approximately 107 hours was spent on the analysis and initial setup of these targets – an average of 7.6 hours per target. This figure is slightly lower than the figure for analysis and setup of OAI targets (8.5 hours per target).
Target Maintenance. Quantification of maintenance effort for Z39.50 targets was relatively straightforward. In total the 14 targets required 20 hours of maintenance an average of 1.4 hours per target.
Examining the details of the recorded Z39.50 setup and maintenance effort (Appendix F) reveals a number of issues which resulted in effort required by the Perx Team;
In addition to these problems, it was evident that Z39.50 targets occasionally experience transient downtime, with a result that they failed to function in the PerX pilot service. Often this would involve no actual maintenance for PerX personnel as these services simply come back up spontaneously. However, even in these cases, transient downtime does result in quality of service issues, as end users simply see that things are not working in the cross-search interface. In an attempt to inform PerX end users more clearly of these ‘transient problems’ warning messages and icons were utilised as follows;
To alert PerX service administrators of Z target downtime, a cron job was scheduled to automate testing of all Z targets and email the administrators when servers were not responding.
“The beauty of exposing metadata in a standard way is that little effort is required for third parties to reuse your metadata, and make it available to their visitors. Standard metadata is therefore an investment in current and future interoperability”
While the focus of the PerX cross-search has always been on standard interoperable sources, it is clear that in practice some potential content providers are unable or unwilling to adopt this approach. Experience from the PerX advocacy work with a number of content providers indicates that many already share metadata with business partners via established proprietary means (e.g. FTP, XML gateways, custom XML schemas, etc). Persuading such players to invest in further mechanisms for standardised metadata exchange often proves to be an uphill struggle. While the concept of standardised metadata exchange sounds attractive, many already have real world solutions in place and are unwilling to expend what is seen to be additional effort to achieve 'standardised interoperability'. Often, PerX advocacy work was well received but ultimately resulted in content providers offering access to their metadata via non-standard means.
A small number of non standard targets with strong engineering content were therefore set up and maintained throughout the project. These included:
A record of the effort required to setup and maintain non standard targets is provided in Appendix G.
Target Analysis & Setup. Four non standard targets were selected for inclusion in PerX. In total, approximately 38 hours was spent on the analysis and initial setup of these targets – an average of 9.5 hours per target. This figure is higher than the figures for analysis and setup of both OAI targets (8.5 hours per target) and Z39.50 targets (7.6 hours per target).
Target Maintenance. In total, the 4 non standard targets required 6 hours of maintenance, an average of 1.5 hours per target. Automatic updating of non standard targets was not achieved. Maintenance of the ICE target was not necessary due to the historical nature of the collection, and the Pearson target was manually updated four times via FTP. Plans to update the Emerald Engineering Journals collection via RSS are outstanding at the time of writing.
No specific challenges relating to non standard targets were identified. Generally speaking, slightly more setup time was required for these targets and maintenance effort varied depending upon the approach taken.
Table 1 summarises the overall set-up and maintenance effort for the 12 month period where details were recorded.
(*Note: OAI-PMH targets were infrequently maintained during the project, see Section 2)
For service providers such as PerX, the effort required for set-up and maintenance of targets is therefore not insubstantial. The total figure of approximately 494 hours over a 12 month period, solely for set-up and maintenance, is likely to be a conservative estimate as some issues have undoubtedly been missed in the recording process. A relatively limited number of targets have been utilised for the pilot – many others are potentially available. Inclusion of more and more targets clearly escalates the set-up and maintenance effort required, which needs to be borne in mind in relation to the possible funding of actual cross-search services. It is also worth noting that the PerX technical team had considerable prior experience of both Z39.50 and OAI-PMH standards and that the recorded effort does not include identification of suitable targets or software development time. The high investment in staff time required perhaps goes some way to explaining the relative lack of similar subject based services which are currently in existence.
An array of challenges exist for services which aim to provide subject based cross-search services utilising harvested data, distributed searching and non standard means. Furthermore, these challenges are different depending upon the means by which targets are cross searched.
By including different types of targets (e.g. a hybrid cross search approach) end users clearly benefit from a wider range of content sources being available. Establishment of a 'critical mass' of relevant subject based materials is likely to be an important aspect in the creation of viable cross-search services, as discussed in the PerX Engineering Landscape Analysis. However, from a Service Providers viewpoint, each target type brings its own array of challenges which must be addressed in order to provide a reasonable level of service. In addition to the issues raised above, such a hybrid approach clearly raises issues for service providers relating to the subsequent search functionality which can be offered to end users. Perhaps the most obvious example relates to the fact that harvested data is held locally whereas distributed search data is not. The harvested data from multiple sources can be easily combined and manipulated, enabling functionality such as merging of results sets and relevance ranking. This, although not impossible, is much more difficult to achieve with distributed search targets and is a potential drawback of the hybrid cross-search approach.
There are many limitations with the OAI-PMH approach as noted in section 2.5. Possible approaches to these challenges include:
1. Recognise the limitations of OAI-PMH and continue to use 'as is' accepting the need for a high level of human investment. While subject based services such as the PerX pilot appear to have merits and have received substantial praise from end users, the decision as to whether this approach is sustainable in the long term is open to question. The perennial question is whether such subject based services can provide a real alternative to freely available web search engines such as Google or Google Scholar. One PerX user commented "Can this service better Google? If not you will struggle to make it worthwhile". While such comparisons are inevitable, they are perhaps somewhat misleading and unhelpful. Google and its ilk are firmly established as the first port of call for the vast majority of web users on the planet, period. As such they are now an integral part of people's online information seeking behaviour for any query in any area ranging from academic endeavor, to personal pursuits, leisure, news and beyond. Subject based cross-search services can at best provide niche services to a relatively small user base - these can never aspire to be an integral part of everyday online usage but instead may provide a useful additional resource discovery tool in certain circumstances.
2. Attempt to address some of the OAI-PMH issues. Shreeves et al (2005) suggests that "If OAI is going to mature into its full potential collections as well as item records need further development, and we need richer mechanisms of creating dialog between harvesters and providers". For example, application profiles could be used to tighten up the metadata that is exchanged via OAI-PMH. The ePrints DC Application Profile has been created for use by Institutional Repositories, and JISC has funded similar applications for still images and time based media and are scoping further application profiles for learning materials. Developing such application profiles and 'best practices' for the provision of metadata via OAI-PMH may help to alleviated some of the issues service providers have to overcome. However, while these are interesting developments, it is likely that considerable effort will be required to encourage the uptake and use of these profiles by Data Providers. Other OAI-PMH problems relating to the high flexibility in the OAI-PMH specification itself are perhaps even more problematic to address retrospectively. Overall, whilst it seems plausible that OAI-PMH may work well in the future, in particular niche areas such as Institutional Repositories it seems likely to remain an effort-intensive option for subject based services which aim to aggregate content of many varied types.
3. Abandon OAI-PMH harvesting as a practical option and invest effort in other approaches to provide subject based cross searching. Alternative options include distributed searching (e.g. Z39.50, SRU/SRW) or automated web crawling and indexing.
While distributed searching approaches have some advantages (e.g. lower maintenance effort) they also present their own challenges (see section 3.3). In addition, the PerX Engineering Digital Repositories Landscape Analysis has indicated that, in contrast to OAI-PMH, very few SRU/SRW targets currently exist. Automated web crawling approaches are certainly worthy of further investigation. As indicated by Lagoze et al (2006) there is a recognition "that the future of collection development in the NSDL relies upon deploying these technologies [automated web crawling and indexing] as a supplement and, in many cases, a replacement for the harvesting model".
While OAI-PMH undoubtedly creates many challenges Brogan (2006) asserts that there is "little doubt that the Open Archives Initiative Protocol for Metadata Harvesting has witnessed remarkable international adoption and growth since 2003...More than 1,000 OAI compliant archives are active across at least 46 countries with an estimated seven million links to full digital object representations...Adoption is likely to accelerate as more countries view OAI implementation as a fast-track to bringing increased visibility to indigenous scholarship." Bearing the fact in mind that a large and growing corpus of metadata is currently available via OAI-PMH it remains likely that potential service providers will continue to utilise this approach, 'warts and all'.
4. Engage in the development of emerging specifications aimed at exchanging digital objects. "There is growing awareness of the limitations of OAI-PMH and the Dublin Core metadata standard that underpin much of the current repository activity along with a call to develop a model and mechanism to handle complex objects held in repositories...in a more fully automated and interoperable way" (Brogan 2006). Initiatives have recently begun to emerge to address these limitations and address possible solutions. In the UK a Common Repositories Interfaces Working Group has been established to identify interfaces critical to the development of interoperable networks of repositories. In the US the OAI-ORE Project (Object Reuse and Exchange) is a recently funded initiative to develop specifications that allow distributed repositories to exchange information about their constituent digital objects. It is envisaged that the ORE framework will "permit fluid reuse, refactoring, and aggregation of scholarly digital objects and their constituent parts - including text, images, data, and software. This framework would include new forms of citation, allow the creation of virtual collections of objects regardless of their location, and facilitate new workflows that add value to scholarly objects by distributed registration, certification, peer review, and preservation services." While these various initiatives are undoubtedly promising, they are in their early stages with concrete specifications likely to be some years away.
|...resource discovery in engineering||| Home | About | Deliverables | Links | Pilot ||