Santiago's Page - ELF

Home Page -> PerX > Implementation -> OAI-PMH Implementation

Collecting and searching OAI repositories

Collecting data from OAI data providers includes metadata harvesting, metadata normalization and metadata enhancement. These three processes use different software components that run co-ordinately but separately from the PAIN admin interface. Please refer to the paper "Overcoming the obstacles of harvesting and searching digital repositories from federated searching toolkits, and embedding them in VLEs" for more detailed information on the issue found when harvesting and using OAI metadata.

Searching with PerX doesn't require a relational database. Instead the powerful Open Source software Lucene is used to index and search the OAI metadata. A MySQL database is only required for search sessions management and for displaying full records.

Metadata harvesting

After trying with different Open Source OAI harvesters, we took the decision to develop a software tool capable of harvesting any type of metadata (**). Currently, for the simple reason of commodity, the PerX OAI harvester is only harvesting records formatted with the Dublin Core (DC) metadata schema. The PerX OAI-PMH harvester is included in the Admin interface (PAIN) and need to be run from a browser with a minimum human involvement. It can be adapted to deal with low volume or high volume of data. It also allows harvesting by sets. The PerX harvester will always try to complete the harvesting of a repository, regardless of whether the involved XML is valid or invalid, well-formed or not. The harvester generates logging information on-screen about the harvesting processes to help the maintenance and debugging tasks. It also logs relevant information on a MySQL database.

(**) At the start of project, we considered the Old Dominion's ARC OAI-PMH harvester, because it was already included in the SPP software. After a number of issues found with the SPP code, in the end we tried the PKP Harvester. However, as March 2006, the PKP software was not supporting important features of the OAI protocol, such as OAI flow control or harvesting by OAI sets (without flow control, it is impossible to harvest from archives like arXiv, and without sets, we would have to harvest all of Oxford database.) Also the PKP mechanism for dealing with errors was poor (which makes it difficult to diagnose problems) and it was not suitable for harvesting large databases such as the NASA repository. That is why we decided to develop a new harvester.

Metadata normalization

Metadata retrieved from OAI repositories need to be made consistent and mapped to the common and unique XML structure used by the PerX MetaSearch Engine to render the search results. As has been noticed, we cannot rely on the data harvested from OAI data providers. Without a normalization process, our federated searching service is exposed to all sorts of malfunctions. The work of the PerX Metadata Normalizer is to stop a variety and significant number of errors being propagated from and by OAI data providers to the PerX database. The normalizer is integrated in and used from PAIN.

Metadata enhancement

The PerX Metadata Enhancer can enhance the metadata harvested from OAI repositories for fulfilling PerX requirements in different ways. For example, it adds fields (e.g. the electronic type document), completes fields (e.g. full bibliographic reference), "cleans" fields (e.g. remove vCard tags), groups fields (e.g. description with notes) or splits fields (e.g. author's names into constituent parts for openURL construction), etc. The enhancer is integrated in and used from PAIN.

Metadata indexing

The harvested metadata is indexed by Lucene software. We only index/classify data by unique-ID, title and URL. The records of the Lucene index files are mapped to MySQL tables contenting the full metadata. For full information on indexing XML files with Lucene please visit the above Lucene web page

Metadata searching

Regarding OAI targets, the search capabilities of the PerX toolkit are basically the same than the provided by Lucene. When a user queries an OAI database, the only function of the toolkit is to receive, format and transfer the query to Lucene. Lucene supports a wide array of possible searches including AND OR and NOT, fuzzy searches, proximity searches, wildcard searches, and range searches. Please take a look at Apache Lucene - Query Parser Syntax for a full description of the search options supported by PerX via Lucene.

Automatic re-harvesting tool implementation investigation

At the end of the PerX Project we haven't other choice than to accept that full automatic re-harvesting of OAI repositories is still a research topic. The reasons for that have been presented and discussed in long in the above paper as well as in the PerX project deliverables.