|International Virtual Observatory Alliance|
The Open Archives Initiative is an effort within the digital library community to develop and promote interoperability standards that enable the dissemination of electronic content. At the core of the Initiative is the Protocal for Metadata Harvesting (PMH) standard through which a data repository can publish metadata about its holdings. While its heritage is in supporting "e-print" repositories, the intended market is clearly broader and could be used to publish any kind of data. In this document, I give an overview of the Harvesting Protocoal, evaluating it in terms of VO registry requirements. In conclusion, I make some recommendations about how OAI may be applied to VO registry applications. The OAI PMH does not address all of our requirements; however, it does have some nice features that could be made part of our overall registry solution.
In the PMH model metadata are exposed in the form of records, in which each record describes a single data item. In the digital library community, a data item has typically been a book or other hard medium, an electronic paper or abstract, a graphical image, etc. However, the data item need not be so fine-grained. The record could describe a larger data collection, and there is an effort within the OAI community to build applications that handle records at this level. The other important component of the model is the unique identifier associated with each record. PMH requires identifiers to conform with the IETF standard for URIs (RFC2396, Berners-Lee 1998).
The interface definition has much in common with that of the Simple Image Access Protocol (Tody et al. 2002). Both define a related set of web-based services that accept URL-encoded (HTTP GET) queries and return results in XML. In PMH, there are six such services, refered to as verbs, that must be implemented by a compliant repository. Three of them form the core of the interface and are used to retrieve the metadata. ListIdentifiers essentially returns all identifiers that match some coarse-grained criteria. GetRecord returns a single record containing the metadata description of a data item given an identifier. ListRecords is like ListIdentifiers, except that it returns the full records for the given criteria. The remaining verbs aid harvesters in making use of the core verbs.
The ListRecords and ListIdentifier services support a fairly coarse-grained set of filtering criteria, enabling what the specification refers to as selective harvesting. It includes filtering on the update time, allowing harvesters to retrieve only records update after a certain date. The other major form of filtering is by Set, which can be thought of a catagories that records can be collected into. Defined by the data provider, sets can group records by collection name, topic, data item type, etc. A record may be associated with multiple sets. The ListSets verb describes the different sets supported by the provider.
Responses from all of the PMH services are encoded in the PMH XML Schema. For the GetRecord and ListRecords verbs, the response format is essentially an XML envelope around a metadata description in some other text-based format. The data provider can for the most part choose what metadata formats it wishes to support; the supported formats are advertised through the ListMetadataFormats. All compliant implemenations are require to support the OAI Dublin Core format, however. Another common format supported in the digital library community is the MARC record format. It is expected that other communities will want to define their own metadata formats that support richer metadata descriptions.
Rounding out the six basic services is the Identify verb. This describes the data provider as a whole, and gives some extra information about its interface (e.g. time range, compression, detetion mechanism).
The OAI site (
http://www.openarchives.org) maintains a registry of compliant
repositories. When a repository administrator registers, she enters
the base URL for the interface into a form at the OAI site. The
registry will then conduct a series of conformance tests on the
implementation. If the interface passes, its Identify response is
stored in the registry. The registry periodically retests the
interface and removes the repository if the tests ever fail. The
contact email that is included in the Identify record allows the
registry to return helpful information about non-conformance issues.
In addition to the registry of compliant repositories, OAI also
maintains a registry of services that operate on data harvested via
range from general purpose interfaces support searches of metadata
from every known PMH-compliant repository to gateways to a subset of
repositories that serve a specific community. Most applications are
of the typical type that collects metadata to a central site for fast
searching; these applications must periodically revisit sites to
update the central database. Some applications, however, allow
search criteria to be given to a web crawler that will retrieve
records and search them on-the-fly.
An example of a general interface is my.OAI (http://www.myoai.com/). Its interface allows users to choose the databases to search and offers text-based searching against the various Dublin Core metadata fields. Dublin Core-based searching is by far the most common type of searching.
An example of a more focused application is the Gateway to Cultural Heritage Materials, a student project developed at the University of Illinois Urbana-Champaign ( http://oai.grainger.uiuc.edu/CandI303/search). This gateway has selected 40 collections related to cultural heritage that it harvests metadata from. In all, it collects approximately 1.5 million records (and produces an additional million records for added value) that can be searched based on Dublin Core metadata. The interface illustrates some specialization for the topic in its handling of time ranges and types of media that are described.
A more sophisticated project is the Open Language Archives Community ( http://www.language-archives.org). This community has defined its own metadata format for describing linguistic resources and supporting advanced searching beyond the Dublin Core. They have developed supporting software, including, for example, an SQL schema for loading their specialized metadata into a relational database. In many ways, this community could serve as a model for the VO.
On the other hand, the PMH is a very good match to our requirements for Registry Contents (section A). First, there is no real restriction in the protocol as to what a record can describe. A resource can describe child resources, services, data collections, or--if desired--individual catalogs and images. Second, we have the ability to define our own XML-based metadata record format, and thus we can ensure that our application-specific information, at whatever complexity level we need, can be included. Simplicity in the metadata, of course, is also allowed.
One of things that makes the OAI model attractive is that the registration information is controlled by the repository. This means that curators can update their descriptions at will. Updates could include editing descriptions, adding new records, or removing others. It is the ultimately the registry's responsibility to update its data accordingly. Furthermore, if we concentrate on registering high level things like resources, collections, and services--things that don't change that much with time--it would be very easy to develop generic PMH implementations (based on static XML documents) that could be distributed to curators. Such a tool could transparently handle sharing of common metadata between many records. All this adds up to making it easy for curators to register their resources and service.
One complication to our potential use of the PMH is our need to support a hierarchical notion of resources. The flexibility afforded by the metadata format allows us latitude to encode hierarchical relationships between records with references. A generic implementation could hide this complication from the curator such that they only deal with a hierarchical description of their holdings.
In summary, for those requirements where it is relevent, I feel that OAI is a good match to our registry requirements. The final decision whether to adopt a PMH-based model needs to not only address how it compares to other solutions, but also the ease in which we can integrate solutions that address the remaining requirements.
In developing a specification of the use of PMH within the VO registry framework, I would make the following recommendations:
Hanisch et al. 2002, "Resource and Service Metadata for the Virtual Observatory (v5)", http://bill.cacr.caltech.edu/cfdocs/usvo-pubs/files/ResourceServiceMetadataV5.pdf.
Lagoze et al. 2002, "The Open Archives Initiative Protocol for Metadata Harvesting (v2.0)", http://www.openarchives.org/OAI/openarchivesprotocol.html.
Plante and the NVO Metadata Working Group 2003, "Requirements for Registries", http://rai.ncsa.uiuc.edu/~rplante/VO/metadata/registryreq.txt.
Tody et al. 2002, "Simple Image Access Prototype Specification (v1.0)", http://www.aoc.nrao.edu/~dtody/sim.html.
Williams, Roy 2003, "OAI for Virtual Observatory", http://archives.us-vo.org/metadata/0534.html.