IVOA Web>IvoaResReg>RegistryOperations>RegisteringBestPracticesDisc (2013-10-23, RayPlante)

Discussion of Best Practices for Registering Resources

Please add your comments:
	Please add your discussion at the end of the Comments/Discussion section; be sure to identify them with your signature (e.g. -- RayPlante - 2013-10-22). You may also insert brief annotations into the preceding sections; label these as well (e.g "(RP)" )

Discussion of Best Practices for Registering Resources
- Background
  - A Problem with Early Proposals
  - Search Requirements and Use Cases
- A Revised Proposal: a single resource per collection
Comments/Discussion

Background

Over several Interops, we have discussed some best practices regarding how to register scientific data collections and the services that access them. The motivation for such practices include:

making them easier to find under targeted searches. We want to make sure that resource descriptions include important information that will likely be used in queries
make search responses more comprehensible by, for example,...
- avoiding displays of search results that appear to have multiple occurances of the same resource (when they are actually subtly different)
- making it clearer what the individual resources represent
- distinguishing between "original" published data and "mirrored" or "re-published" data

Among the ways we have discussed doing this is by promoting a uniform pattern for registering resources. The idea is that if collections and their services were registered according to a uniform convention (and that convention could be recognized when in use), client applications (such as the VAO DDT) could provide a more meaningful and easier to understand display of search results.

A Problem with Early Proposals

One convention that has been proposed and which is already in some use now involved separating the description of the underlying collection from the description of the services that access it into separate resources (e.g. see slide 5, "A Common Registration Pattern" of Plante's May2013 presentation). That is,

a data collection would be registered as a DataCollection resource; this discription would include all of the science-related information in it (including table column descriptions, if applicable).
services that access the collection registered collectively but separately (as a DataService or CatalogService)
relationship links would connect to the collection with its services.

At the Sept. 2013 Interop, Markus Demleitner reported on some important disadvantages this approach presents for some simple but important search use cases using the TAP interface (see slide 6, "Uneasy Relationships", and beyond of his presentation for details). In particular, separating the access metadata (in the Service resource) from the science metadata (in the DataCollection resource) requires some fairly complex joining using relationship information. It is easy to argue that under this convention, certain simple queries can not be expressed simply.

Search Requirements and Use Cases

A well-annotated description of requirements and use cases for searching for resources is available via the RestfulRegistryInterfaceReq page. There are a few use cases that are particularly relevent to this discussion:

We want to be able to find resources based on scientific topics. This requires that resource descriptions include specific words about the science behind the data.
- RayPlante notes: this suggests that data collections be registered at a somewhat fine-grained level (a la Vizier catalogs).
We want to be able to find all services supporting a particular data access protocol (e.g. all SIA services or all TAP services)
Combination of 1. and 2: find all collections related to a particular science topic that supports a particular protocol (e.g. find molecular cloud observations accessible via SSA) ; return the accessURLs.

The third use case in which we need the accessURL (access metadata) along with the metadata containing the science metadata (e.g. description, column information) that makes queries messy when the science and access metadata are in different resources.

A Revised Proposal: a single resource per collection

We can avoid the joining mess described above if the science and access metadata are included in the same resource description. Doing so would also make it easier to interpret search results. An apt proposal, then, would be to combine the role of the DataCollection and DataService (or rather CatalogService) into a single resource type.

In detail the proposal would be described as follows:

Every data provider is registered via an Organisation resource (and their authorities registered separately as well)
Each data collection published by the provider is registered with a resource type^* that includes both science metadata describing what is in the collection and capability elements for each of the services that access the collection.
relationship elements would be used to connect it to other related collections. In particular, if a collection is mirror or derived from another collection, it would include a relationship that points back to the source collection.

^*The Collection Resource Type and the Evolution of a Resource

There is some question as to the "proper" resource type we should use--that is, whether we can use an existing type or we would need to define a new one. We observe that the existing CatalogService resource type provides all of the metadata required to describe a collection and its services. Re-using this type would obviously have no impact on the TAPReg schema. Still, the semantics of this type, one could argue, do not quite match that of a "collection".

For example, we allow (and in fact encourage) providers to register collections even before there are any services available to access it. Perhaps there is only a web page (which would be given via the referenceURL element). Perhaps the collection is a simple catalog that is downloadable as a single file but not yet searchable via ConeSearch. Semantically, this resource has risen yet to the status of a "service". Nevertheless, a CatalogService resource record is not required to have any capability elements, so syntactically, a CatalogService resource would work just fine.

There is one feature of a DataCollection that is not strictly present in a CatalogService: a DataCollection provides a special accessURL element for accessing the collection as a whole (e.g. as a single file or, say, a directory of files). However, this would simple to capture in a generic capability if need be. (It would be worth defining a standardID URI to identify it; a capability extension would not be necessary.)

If we felt that semantics were important (or we had other reasons not to use the CatalogService type in this role), defining a new type need not be difficult since no new metadata elements need to be defined. (It might look as simple as this.) For this reason, the impact on existing registries would likely be minimal. One reason to define a new type (that otherwise looks just like CatalogService)--let's call it DataCollection2 --would be provide an explicit signal to clients that a resource complies with the unified convention. In particular, if one found a DataCollection2 resource, one could be certain that there are no separate resources describing its services, whether it has capability elements or not.

So, we conclude that we have two choices for registering data collections and their services together:

use CatalogService, ignoring the semantic inaccuracies
define a new resource type (e.g. DataCollection2) with the correct semantics but the same contents as CatalogService.

For purposes of discussion below, we will refer to collection resource registered according to this proposal as DataCollection2 resource, regardless of whether it is represented as CatalogService or as some new resource type.

Services accessing Multiple Collections

There exists several examples in which a provider puts up a single service to access many collections. For most of our DAL services, this fact is not important to clients. An exception, however, is TAP: a TAP service that can access several catalogs means the user can make ADQL joins across these tables. This is a fact that is potentially useful to users.

Related to this is an important search use case: find me all TAP services (i.e. with no other qualifications). This might be used by a specialized client application that can query an arbitrary TAP service; it may wish to list all available TAP services as a "pick-list". This one place where you might want to see that a single service can access many collections.

To address this case, we propose:

Each TAP-accessible catalog is registered separately as described above to segregate the science metadata. Each will include a description of the tables and columns in the particular catalog.
A separate CatalogResource record is provided to represent the TAP service as a whole. It includes its TAP capability entry along with the "service-for" relationship element pointing to each of the individual catalog resources.
The TAP capability entry from the aggregating TAP service is replicated into all of the individual catalog resources.

Archives Hosting Multiple Collections

There are many cases in which a single archive hosts multiple collections. In this section, we describe how to apply the proposed convention to such archives. We consider three cases:

A data center that hosts archives for several missions or observatories (e.g. the IPAC case)
A repository containing many astronomer-published collections of processed data (e.g. the DataVerse case).
An observatory archive containing raw or pipeline-processed products from many guest-observer observations (e.g. the Chandra or HST case)

In practice, these cases may overlap and so the publisher would use judgement as to how combine these cases appropriately.

How much detail is provided in the resource registrations is another choice of the provider. Below, we describe the most detailed representations; however, in practice a publisher may start more simply and add more resources/detail as person-resources allow.

Case 1: The Data Center

A canonical example of this case might be NASA IPAC which hosts archives for several NASA missions as well as the heterogeneous information system, NED. In the spirit of the proposed convention, the center's assets would be registered as follows:

the data center itself would be registered as an Organisation; this resource would be referred to as the publisher of all of the data resources.
each mission archive would be registered as a separate DataCollection2. It would include a capability for each standard and archive-specific service that can access that collection.
- any NED- or SIMBAD-like resource would also be represented as a DataCollection2; it's custom services could be enumerated as capabilities.
if there is some service that can search across all of the archives, it could be represented by either an "uber" DataCollection2 resource (representing the union of all of the collections) or simply as a CatalogService. The individual mission resources would have "part-of" relationship entries that point to this "uber" resource.

Case 2: The Data Repository

The canonical example of this case might be a DataVerse repository or the ADIL where individual astronomers or research groups might deposit data for publishing into the VO. Their data would be related by some common science investigation we might refer to as a "project".

This would be represented in a manner similar to the Data Center case where we substitute "mission" with "project":

The institution running the repository would have an Organisation record that would serve as the repository's publisher.
The repository would be represented by a DataCollection2 resource.
- It would include a Type element set to "Archive".
- Any service that is capable of searching across the entire repository would be listed as a capability.
Each collection within the repository would be represented by a DataCollection2 resource.
- Each would contain science metadata specific to that collection.
- It would include a capability element for each service that accesses the collection (or part thereof)

Case 3: The Guest Observatory Archive

A "Guest Observer" facility is an observatory that grants observing time to astronomers via some proposal mechanism and carries out heterogeneous observing plans according to the approved proposals. The associated archive may contain the raw data and/or pipeline-processed data. If the data archive has access to the proposal database, the archive could be treated just like a data repository described in Case 2; the "proposal" would represent the "project" and would be the source of the science metadata.

Reasons that a publisher would not want to represent the archive like a data repository might include:

by policy or due to technical reasons, the proposal information is not available.
the number of proposals is too large to effectively curate/maintain as separate resources
the dynamic nature of the archvie--the fact that new projects are being added all the time--makes it impractical to curate/maintain the proposal-projects as separate resources

Assuming that the publisher does not want to separately register each proposal project separately, the recommended way to register would be as follows:

The observatory is registered as an Organisation to serve as the publisher of the archive.
The archive is represented as a DataCollection2 resource.
- It would include a Type element set to "Archive".
- Each service for access data in the archive is listed as a capability.

-- RayPlante - 2013-10-22