Registry Working Group

Links:	Registry Twiki	Registry Mail Archive	IVOA Members

International Virtual Observatory Alliance

An Evaluation of the Open Archives Initiative for VO Registries

Ray Plante
Last modified: Mon Feb 17 16:42:16 2003

Abstract

The Open Archives Initiative is an effort within the digital library community to develop and promote interoperability standards that enable the dissemination of electronic content. At the core of the Initiative is the Protocal for Metadata Harvesting (PMH) standard through which a data repository can publish metadata about its holdings. While its heritage is in supporting "e-print" repositories, the intended market is clearly broader and could be used to publish any kind of data. In this document, I give an overview of the Harvesting Protocoal, evaluating it in terms of VO registry requirements. In conclusion, I make some recommendations about how OAI may be applied to VO registry applications. The OAI PMH does not address all of our requirements; however, it does have some nice features that could be made part of our overall registry solution.

Motivation for looking at the OAI Harvesting Protocol

The potential benefits of adopting the Protocal for Metadata Harvesting (PMH) for use in federating VO resources are as follows:

The reuse of an existing framework with experience supporting real applications will save us time designing and debugging our own.
We can resuse existing software.
We can more easily integrate resources outside our community they support the same OAI standard.
Other communities can become more aware of astronomical data resources.

In addition to examining the match of the PMH to our requirements, we should consider to what extent these advantages can be realized.

The OAI-PMH Model

The Protocal for Metadata Harvesting (PMH) (Lagoze et al. 2002) is a standard interface that data repositories can implement to expose metadata about their holdings to the outside world. As a standard, public interface, it is open to a variety of applications; however, it in particular is desgined to support use by automated agents that collect the metadata to a central site which in turn can offer custom cross-collection searching (see Example Applications).

In the PMH model metadata are exposed in the form of records, in which each record describes a single data item. In the digital library community, a data item has typically been a book or other hard medium, an electronic paper or abstract, a graphical image, etc. However, the data item need not be so fine-grained. The record could describe a larger data collection, and there is an effort within the OAI community to build applications that handle records at this level. The other important component of the model is the unique identifier associated with each record. PMH requires identifiers to conform with the IETF standard for URIs (RFC2396, Berners-Lee 1998).

The interface definition has much in common with that of the Simple Image Access Protocol (Tody et al. 2002). Both define a related set of web-based services that accept URL-encoded (HTTP GET) queries and return results in XML. In PMH, there are six such services, refered to as verbs, that must be implemented by a compliant repository. Three of them form the core of the interface and are used to retrieve the metadata. ListIdentifiers essentially returns all identifiers that match some coarse-grained criteria. GetRecord returns a single record containing the metadata description of a data item given an identifier. ListRecords is like ListIdentifiers, except that it returns the full records for the given criteria. The remaining verbs aid harvesters in making use of the core verbs.

The ListRecords and ListIdentifier services support a fairly coarse-grained set of filtering criteria, enabling what the specification refers to as selective harvesting. It includes filtering on the update time, allowing harvesters to retrieve only records update after a certain date. The other major form of filtering is by Set, which can be thought of a catagories that records can be collected into. Defined by the data provider, sets can group records by collection name, topic, data item type, etc. A record may be associated with multiple sets. The ListSets verb describes the different sets supported by the provider.

Responses from all of the PMH services are encoded in the PMH XML Schema. For the GetRecord and ListRecords verbs, the response format is essentially an XML envelope around a metadata description in some other text-based format. The data provider can for the most part choose what metadata formats it wishes to support; the supported formats are advertised through the ListMetadataFormats. All compliant implemenations are require to support the OAI Dublin Core format, however. Another common format supported in the digital library community is the MARC record format. It is expected that other communities will want to define their own metadata formats that support richer metadata descriptions.

Rounding out the six basic services is the Identify verb. This describes the data provider as a whole, and gives some extra information about its interface (e.g. time range, compression, detetion mechanism).

The OAI site ( http://www.openarchives.org) maintains a registry of compliant repositories. When a repository administrator registers, she enters the base URL for the interface into a form at the OAI site. The registry will then conduct a series of conformance tests on the implementation. If the interface passes, its Identify response is stored in the registry. The registry periodically retests the interface and removes the repository if the tests ever fail. The contact email that is included in the Identify record allows the registry to return helpful information about non-conformance issues.

Example Applications

In addition to the registry of compliant repositories, OAI also maintains a registry of services that operate on data harvested via PMH ( http://www.openarchives.org/service/listproviders.html). These range from general purpose interfaces support searches of metadata from every known PMH-compliant repository to gateways to a subset of repositories that serve a specific community. Most applications are of the typical type that collects metadata to a central site for fast searching; these applications must periodically revisit sites to update the central database. Some applications, however, allow search criteria to be given to a web crawler that will retrieve records and search them on-the-fly.

An example of a general interface is my.OAI (http://www.myoai.com/). Its interface allows users to choose the databases to search and offers text-based searching against the various Dublin Core metadata fields. Dublin Core-based searching is by far the most common type of searching.

An example of a more focused application is the Gateway to Cultural Heritage Materials, a student project developed at the University of Illinois Urbana-Champaign ( http://oai.grainger.uiuc.edu/CandI303/search). This gateway has selected 40 collections related to cultural heritage that it harvests metadata from. In all, it collects approximately 1.5 million records (and produces an additional million records for added value) that can be searched based on Dublin Core metadata. The interface illustrates some specialization for the topic in its handling of time ranges and types of media that are described.

A more sophisticated project is the Open Language Archives Community ( http://www.language-archives.org). This community has defined its own metadata format for describing linguistic resources and supporting advanced searching beyond the Dublin Core. They have developed supporting software, including, for example, an SQL schema for loading their specialized metadata into a relational database. In many ways, this community could serve as a model for the VO.

The Fit to Requirements

When we compare this technology with the requirements for registries draft from the NVO project (Plante et al. 2003), it's clear that the PMH does not address the entire problem of registries. It can mainly addresses who manages the information that goes into registries--i.e. the originating repositories--and how that information gets into the registry (section C of Plante et al. 2003). It cannot not address how users query the registry (section B) since the protocol does not support complex criteria (e.g. freq > 400.0 nm) as part of its selective harvesting mechanism. As described above, however, it is not part of PMH's intention to address how the data is search once it is collected.

On the other hand, the PMH is a very good match to our requirements for Registry Contents (section A). First, there is no real restriction in the protocol as to what a record can describe. A resource can describe child resources, services, data collections, or--if desired--individual catalogs and images. Second, we have the ability to define our own XML-based metadata record format, and thus we can ensure that our application-specific information, at whatever complexity level we need, can be included. Simplicity in the metadata, of course, is also allowed.

One of things that makes the OAI model attractive is that the registration information is controlled by the repository. This means that curators can update their descriptions at will. Updates could include editing descriptions, adding new records, or removing others. It is the ultimately the registry's responsibility to update its data accordingly. Furthermore, if we concentrate on registering high level things like resources, collections, and services--things that don't change that much with time--it would be very easy to develop generic PMH implementations (based on static XML documents) that could be distributed to curators. Such a tool could transparently handle sharing of common metadata between many records. All this adds up to making it easy for curators to register their resources and service.

One complication to our potential use of the PMH is our need to support a hierarchical notion of resources. The flexibility afforded by the metadata format allows us latitude to encode hierarchical relationships between records with references. A generic implementation could hide this complication from the curator such that they only deal with a hierarchical description of their holdings.

In summary, for those requirements where it is relevent, I feel that OAI is a good match to our registry requirements. The final decision whether to adopt a PMH-based model needs to not only address how it compares to other solutions, but also the ease in which we can integrate solutions that address the remaining requirements.

Conclusions and Recommendations

The overall flexibility of the OAI-PMH model makes exploring a PMH-based specification of VO registries worthwhile. The active use of OAI in a variety of digital library communities makes the possibility of leveraging off of existing experience, resources, and software quite likely.

In developing a specification of the use of PMH within the VO registry framework, I would make the following recommendations:

We should build the framework assuming that the most common types of records that will be harvested into VO registries are descriptions of high-level thing--namely, resources/repositories, data collections, and services. It should be possible to register more fine grained things, but at least initially, this will be much less common. This is based on the idea that registries are used to support coarse searching for resources that might have what the user wants; a definitive search would accomplished by querying the candidate resources directly.
We should define standard VO PMH Sets for resources/repositories, data collections, and services as a way of distinguishing between these types of records.
We should plan to develop our own XML record format that can include our application-specific metadata, such as coverage information. This schema should, as a start, include the resource and service metadata concepts (Hanisch et al. 2002).
Our VO XML record format should allow a record to refer to a parent resource via that resource's identifier. For an item A to be a parent of item B implies that the item A logically contains item B. This would allow us to indicate that, for example, that a sub-resource (say HEASARC) is part of another resource (say NASA). Furthermore, the relationship should further imply that item B shares all of its parent's resource metadata unless explicitly overriden in the XML description of item B. This would allow us minimize the repetition of metadata in the various descriptions.
Our hierarchical model should be based on that presented in Hanisch et al. 2002. In particular, a resource can contain:
- other resources
- data collections
- services
Nothing other than resources can contain other resources. Data collections may contain other collections (curated by the same resource). Services include those that access data from a data collection (e.g. SIA).
We should prototype a generic implementation of the PMH services. It should include tools that make it easy to enter in new description information. This could be easily accomplished using a model in which the XML records are stored and edited as static files.
We should consider defining as part of the Registry interface two services:
- Register: request a namespace and attempt to register a new resource. The inputs would include the requested identifier namespace, and the PMH base URL. The consequence of registering a PMH implementation would be similar to the consequence of registering with OAI.
- Update: request an update of harvested records in the registry. This could be called by a resource to indicate that its records had changed. (Calling this service, of course, would be optional, as the registry should periodically check for updates.)

References

Berners-Lee et al. 1998, "Uniform Resource Identifiers (URI): Generic Syntax", http://www.ietf.org/rfc/rfc2396.txt?number=2396.

Hanisch et al. 2002, "Resource and Service Metadata for the Virtual Observatory (v5)", http://bill.cacr.caltech.edu/cfdocs/usvo-pubs/files/ResourceServiceMetadataV5.pdf.

Lagoze et al. 2002, "The Open Archives Initiative Protocol for Metadata Harvesting (v2.0)", http://www.openarchives.org/OAI/openarchivesprotocol.html.

Plante and the NVO Metadata Working Group 2003, "Requirements for Registries", http://rai.ncsa.uiuc.edu/~rplante/VO/metadata/registryreq.txt.

Tody et al. 2002, "Simple Image Access Prototype Specification (v1.0)", http://www.aoc.nrao.edu/~dtody/sim.html.

Williams, Roy 2003, "OAI for Virtual Observatory", http://archives.us-vo.org/metadata/0534.html.

Document Maintainer: Ray Plante