Santiago's Page - PerX - Z39.50 Implementation

Home Page -> PerX > Implementation -> Z39.50 Implementation

1. Building queries for searching Z39.50 targets

Z39.50 supports several types of query formats, including CQL and even SQL. However the most common form of query is Reverse Polish Notation (RPN). When it's sent across the wire, it's encoded into the structure of the request, rather than as a string. It seems that there has never been an official string representation. However, Prefix Query Format PQF/PQN has become the de facto standard for string encoding. YAZ, the software API used by PerX for querying Z39.50 targets, uses PQF/PQN for constructing the queries sent to Z39.50 servers.

Below, there are first a quick introduction of basic concepts and then some examples of building PQF queries, taken from the algorithms used by the PerX MetaSearch Engine.

Introduction

Firstly, RPN does not have a concept of an 'index', 'table', 'column' or any physical representation of a collection of data. Instead it abstracts this into a vector of 'attributes'. Each attribute has a numeric identifier. These attributes are collected together in 'attribute sets'. (This is where CQL context sets have been derived from.)

The most commonly supported attribute set is BIB1 -- the bibliographic attribute set version 1. BIB1 has 6 attribute types:

Use The main semantics of the content. Eg title, author, date
Relation How to compare the term. eg equal, less than, within
Position The position of the term in the field, eg first, anywhere
Structure Structure of the term, eg date, phrase, string, numeric
Truncation How to truncate, eg, right truncation, regexp, none
Completeness Complete field versus incomplete

As per the examples, each type has several values, the most populated being, unsurprisingly Use attributes. There is a very long list of use attributes available, in theory, on the Bib1 reference page. Common ones are:

1 Personal Name
4 Title
7 ISBN
8 ISSN
12 Local Number
21 Subject
30 Date
1003 Author
1010 Body of Text
1016 Any
1018 Publisher

Examples

To start lets put everything together into a simple query using PQF:
A search query for the word XML in the field title
could be written as:
@attr 1=4 @attr 2=3 "XML"

Clauses as above can be linked with a prefixed boolean operator.
Thus, title = xml and author = sanderson
Could be written as:
@and @attr 1=4 @attr 2=3 "XML" @attr 1=1003 @attr 2=3 "Sanderson"

Available booleans, with the expected semantics, are: @and, @or, @not

The above queries let the server decide if the query is a keyword or exact search, amongst other possibilities. To specify this, we need to add another attribute into the vector:
Thus @attr 1=4 @attr 2=3 @attr 4=2 "xml" Is a keyword search.

In theory, that's all that's necessary to know for bibliographic searching using Z39.50. However, for a real stable and reliable service, we cannot work with such limited syntaxes assuming that the default settings of all Z39.50 servers will be correct for us. Below we have included a little bit more advanced queries concerning less frequently used facilities, with the hope that can be useful to implementators dealing with Z39.50.

When a server completes a search it creates a named result net containing references to the matched records. This result set may be referenced in a query, for example to merge result sets. Using PQF:
@and @set resultSet1 @set resultSet2
would perform the intersection of resultSet1 and resultSet2.

More advanced sample queries include the groupings of Boolean operators and combinations of attribute types, for example:
@attrset bib-1 @or @and @attr 1=4 @attr 2=3 @attr 3=1 @attr 6=1 "XML with Java" @attr 1=1016 @attr 2=104 @attr 3=3 @attr 6=1 "Morrinson" @and @attr 1=62 @attr 2=104 @attr 3=3 "XML" @attr 1=62 @attr 2=104 @attr 3=3 "Java"
which will search for records with title equal to the phrase "XML with Java" AND "Morrinson" as author OR for records with the keywords XML AND Java in abstract.

While BIB1 is the most common attribute set, there are others. For example a more advanced specification is the Attribute Architecture which adds further dimensionality to the request, including notions of semantic and functional qualifiers. An Author might be a name, qualified with 'personal' and 'creator'. Attributes within a single clause may come from different attribute sets, so you might see:

@attrset XD @attr 1=3 @attrset BIB2 @attr 2=3 @attrset UTIL @attr 12=2 Rob name, personal and creation in this order.

While the Attribute Architecture is technically superior, only the most advanced Z39.50 implementations actively support it, which does not include any (to my knowledge) of the commonly used library systems. None of the Z39.50 targets of interest to PerX are using the Attribute Architecture specification.

One thing to take into account is proximity searches. Boolean searches are all very well, but if you want to find two keywords which are near each other but not adjacent, then you need to use proximity. PQF's representation of proximity leaves a lot to be desired in terms of expressability. Below is an simplified example of how to construct the @prox operator. We have not used proximity in PerX.
@or bridge concrete @prox construction

2. Z39.50 targets error detection study

Each time we query a Z39.50 database, we need to sequentially execute four operations:
- Connection
- Initialisation
- Search
- Retrieval
An error can occur or be produced by any of the above operations. Thus, in order to be able to take an action as soon as an error happens, we introduced an "error interceptor" mechanism in the code handling Z39.50 queries. The interceptor was able to detect errors for each operation. However, after a number of trials with the interceptor, we took notice of the following points:

We observed that the "negotiation of connection" process between the PerX code (YAZ) and the remote Z-server involves an indefinite number n of unsuccessful connections, where n ≥ 0 & (n+1) is the instance when the successful connection is established, before timing out. Identifying a priori a suitable value for n, for each Z39.50 database, was an impracticable task. Sometimes a value for n (picked up using random fuzzy approaches) worked for some searches, but do not worked for others.
Once a connection is established, the chances of having errors in the subsequent operations are minimum. The initialisation normally does not produce errors if the database has been properly identified and it is available on the Z-server during the query. Note that during initialisation, the Z-server is in control. Whatever may or may not occur between the PerX and the Z39.50 target is up to the Z-server. Here also any optional authentication process can produce errors.
Errors in the search and retrieval operations are mainly produced by network problems (timing out) or lost of connection. These operations can involve an identified number of unsuccessful requests between PerX and the Z-server. Again, identifying when to abort the operation and produce an error notification is challenging due the reasons presented above.

In conclusion, it is not feasible to introduce an error interceptor mechanism to detect errors in "real time" or immediately when an error is detected in one of the four operations involved in a Z39.50 query. What is advisable is to use the "Test Z server" facility, available from the PAIN admin interface, to monitor the status of Z39.50 targets from time to time.

3. Further information

- http://www.unt.edu/wmoen/Z3950/GIZMO/section1.htm
- http://eprints.rclis.org/archive/00006394
- http://www.loc.gov/z3950/agency