Discussion of Best Practices for Registering Resources
BackgroundOver several Interops, we have discussed some best practices regarding how to register scientific data collections and the services that access them. The motivation for such practices include:
A Problem with Early Proposals | ||||||||
Changed: | ||||||||
< < | One convention that has been proposed and which is already in some use now involved separating the description of the underlying collection from the description of the services that access it into separate resources (e.g. see "A Common Registration Pattern" slide of Plante's May2013 presentation). That is, | |||||||
> > | One convention that has been proposed and which is already in some use now involved separating the description of the underlying collection from the description of the services that access it into separate resources (e.g. see slide 5, "A Common Registration Pattern" of Plante's May2013 presentation). That is, | |||||||
| ||||||||
Changed: | ||||||||
< < | At the Sept. 2013 Interop, Markus Demleitner reported on some important disadvantages this approach presents for some simple but important search use cases using the TAP interface (see slide 6, "Uneasy Relationships", and beyond of his presentation for details). In particular, separating the access metadata (in the Service resource) from the science metadata (in the DataCollection resource) requires some fairly complex joining using relationship information. It is easy to argue that under this convention, certain simple queries can not be expressed simply. | |||||||
> > | At the Sept. 2013 Interop, Markus Demleitner reported on some important disadvantages this approach presents for some simple but important search use cases using the TAP interface (see slide 6, "Uneasy Relationships", and beyond of his presentation for details). In particular, separating the access metadata (in the Service resource) from the science metadata (in the DataCollection resource) requires some fairly complex joining using relationship information. It is easy to argue that under this convention, certain simple queries can not be expressed simply. | |||||||
Search Requirements and Use CasesA well-annotated description of requirements and use cases for searching for resources is available via the RestfulRegistryInterfaceReq page. There are a few use cases that are particularly relevent to this discussion:
| ||||||||
Changed: | ||||||||
< < |
| |||||||
> > |
| |||||||
Added: | ||||||||
> > | The third use case in which we need the accessURL (access metadata) along with the metadata containing the science metadata (e.g. description, column information) that makes queries messy when the science and access metadata are in different resources. | |||||||
A Revised Proposal: a single resource per collectionWe can avoid the joining mess described above if the science and access metadata are included in the same resource description. Doing so would also make it easier to interpret search results. An apt proposal, then, would be to combine the role of theDataCollection and DataService (or rather CatalogService ) into a single resource type.
In detail the proposal would be described as follows:
*The Collection Resource Type and the Evolution of a ResourceThere is some question as to the "proper" resource type we should use--that is, whether we can use an existing type or we would need to define a new one. We observe that the existingCatalogService resource type provides all of the metadata required to describe a collection and its services. Re-using this type would obviously have no impact on the TAPReg schema. Still, the semantics of this type, one could argue, do not quite match that of a "collection".
For example, we allow (and in fact encourage) providers to register collections even before there are any services available to access it. Perhaps there is only a web page (which would be given via the referenceURL element). Perhaps the collection is a simple catalog that is downloadable as a single file but not yet searchable via ConeSearch. Semantically, this resource has risen yet to the status of a "service". Nevertheless, a CatalogService resource record is not required to have any capability elements, so syntactically, a CatalogService resource would work just fine. | ||||||||
Changed: | ||||||||
< < | There is one feature of a DataCollection that is not strictly present in a CatalogService : a DataCollection provides a special accessURL element for accessing the collection as a whole (e.g. as a single file or, say, a directory of files). However, this would simple to capture in a generic capability if need be. | |||||||
> > | There is one feature of a DataCollection that is not strictly present in a CatalogService : a DataCollection provides a special accessURL element for accessing the collection as a whole (e.g. as a single file or, say, a directory of files). However, this would simple to capture in a generic capability if need be. (It would be worth defining a standardID URI to identify it; a capability extension would not be necessary.) | |||||||
Changed: | ||||||||
< < | If we felt that semantics were important (or we had other reasons not to use the CatalogService type in this role), defining a new type need not be difficult since no new metadata elements need to be defined. (This is discussed further below.) For this reason, the impact on existing registries would likely be minimal. One reason to define a new type (that otherwise looks just like CatalogService )--let's call it DataCollection2 --would be provide an explicit signal to clients that a resource complies with the unified convention. In particular, if one found a DataCollection2 resource, one could be certain that there are no separate resources describing its services, whether it has capability elements or not. | |||||||
> > | If we felt that semantics were important (or we had other reasons not to use the CatalogService type in this role), defining a new type need not be difficult since no new metadata elements need to be defined. (It might look as simple as this.) For this reason, the impact on existing registries would likely be minimal. One reason to define a new type (that otherwise looks just like CatalogService )--let's call it DataCollection2 --would be provide an explicit signal to clients that a resource complies with the unified convention. In particular, if one found a DataCollection2 resource, one could be certain that there are no separate resources describing its services, whether it has capability elements or not. | |||||||
So, we conclude that we have two choices for registering data collections and their services together:
DataCollection2 resource, regardless of whether it is represented as CatalogService or as some new resource type.
Services accessing Multiple CollectionsThere exists several examples in which a provider puts up a single service to access many collections. For most of our DAL services, this fact is not important to clients. An exception, however, is TAP: a TAP service that can access several catalogs means the user can make ADQL joins across these tables. This is a fact that is potentially useful to users. Related to this is an important search use case: find me all TAP services (i.e. with no other qualifications). This might be used by a specialized client application that can query an arbitrary TAP service; it may wish to list all available TAP services as a "pick-list". This one place where you might want to see that a single service can access many collections. To address this case, we propose:
Archives Hosting Multiple CollectionsThere are many cases in which a single archive hosts multiple collections. In this section, we describe how to apply the proposed convention to such archives. We consider three cases:
Case 1: The Data CenterA canonical example of this case might be NASA IPAC which hosts archives for several NASA missions as well as the heterogeneous information system, NED. In the spirit of the proposed convention, the center's assets would be registered as follows:
Case 2: The Data RepositoryThe canonical example of this case might be a DataVerse repository or the ADIL where individual astronomers or research groups might deposit data for publishing into the VO. Their data would be related by some common science investigation we might refer to as a "project". This would be represented in a manner similar to the Data Center case where we substitute "mission" with "project":
Case 3: The Guest Observatory ArchiveA "Guest Observer" facility is an observatory that grants observing time to astronomers via some proposal mechanism and carries out heterogeneous observing plans according to the approved proposals. The associated archive may contain the raw data and/or pipeline-processed data. If the data archive has access to the proposal database, the archive could be treated just like a data repository described in Case 2; the "proposal" would represent the "project" and would be the source of the science metadata. Reasons that a publisher would not want to represent the archive like a data repository might include:
| ||||||||
Added: | ||||||||
> > | -- RayPlante - 2013-10-22 | |||||||
Comments/Discussion<--
| ||||||||
Added: | ||||||||
> > |
| |||||||