Discussion of Best Practices for Registering Resources

Background

Over several Interops, we have discussed some best practices regarding how to register scientific data collections and the services that access them. The motivation for such practices include:

  • making them easier to find under targeted searches. We want to make sure that resource descriptions include important information that will likely be used in queries
  • make search responses more comprehensible by, for example,...
    • avoiding displays of search results that appear to have multiple occurances of the same resource (when they are actually subtly different)
    • making it clearer what the individual resources represent
    • distinguishing between "original" published data and "mirrored" or "re-published" data

Among the ways we have discussed doing this is by promoting a uniform pattern for registering resources. The idea is that if collections and their services were registered according to a uniform convention (and that convention could be recognized when in use), client applications (such as the VAO DDT) could provide a more meaningful and easier to understand display of search results.

A Problem with Early Proposals

One convention that has been proposed and which is already in some use now involved separating the description of the underlying collection from the description of the services that access it into separate resources (e.g. see "A Common Registration Pattern" slide of Plante's May2013 presentation). That is,

  • a data collection would be registered as a DataCollection resource; this discription would include all of the science-related information in it (including table column descriptions, if applicable).
  • services that access the collection registered collectively but separately (as a DataService or CatalogService)
  • relationship links would connect to the collection with its services.

At the Sept. 2013 Interop, Markus Demleitner reported on some important disadvantages this approach presents for some simple but important search use cases using the TAP interface (see slide 6, "Uneasy Relationships", and beyond of his presentation for details). In particular, separating the access metadata (in the Service resource) from the science metadata (in the !DataCollection resource) requires some fairly complex joining using relationship information. It is easy to argue that under this convention, certain simple queries can not be expressed simply.

A Revised Proposal: a single resource per collection

We can avoid the joining mess described above if the science and access metadata are included in the same resource description. Doing so would also make it easier to interpret search results. I propose, then, that we combine the role of the DataCollection and DataService (or rather CatalogService) into a single resource type.

In detail the proposal would be described as follows:

  • Every data provider is registered via an Organisation resource (and their authorities registered separately as well)
  • Each data collection published by the provider is registered with a resource type* that includes both science metadata describing what is in the collection and capability elements for each of the services that access the collection.
  • relationship elements would be used to connect it to other related collections. In particular, if a collection is mirror or derived from another collection, it would include a relationship that points back to the source collection.

*The Collection Resource Type and the Evolution of a Resource

There is some question as to the "proper" resource type we should use--that is, whether we can use an existing type or we would need to define a new one. We observe that the existing CatalogService resource type provides all of the metadata required to describe a collection and its services. Re-using this type would obviously have no impact on the TAPReg schema. Still, the semantics of this type, one could argue, do not quite match that of a "collection".

For example, we allow (and in fact encourage) providers to register collections even before there are any services available to access it. Perhaps there is only a web page (which would be given via the referenceURL element). Perhaps the collection is a simple catalog that is downloadable as a single file but not yet searchable via ConeSearch. Semantically, this resource has risen yet to the status of a "service". Nevertheless, a CatalogService resource record is not required to have any capability elements, so syntactically, a CatalogService resource would work just fine.

There is one feature of a DataCollection that is not strictly present in a CatalogService: a DataCollection provides a special accessURL element for accessing the collection as a whole (e.g. as a single file or, say, a directory of files). However, this would simple to capture in a generic capability if need be.

If we felt that semantics were important (or we had other reasons not to use the CatalogService type in this role), defining a new type need not be difficult since no new metadata elements need to be defined. (This is discussed further below.) For this reason, the impact on existing registries would likely be minimal. One reason to define a new type (that otherwise looks just like CatalogService)--let's call it DataCollection2 --would be provide an explicit signal to clients that a resource complies with the unified convention. In particular, if one found a DataCollection2 resource, one could be certain that there are no separate resources describing its services, whether it has capability elements or not.

So, we conclude that we have two choices for registering data collections and their services together:

  • use CatalogService, ignoring the semantic inaccuracies
  • define a new resource type (e.g. DataCollection2) with the correct semantics but the same contents as CatalogService.

Regarding Services accessing Multiple Collections

Edit | Attach | Watch | Print version | History: r4 < r3 < r2 < r1 | Backlinks | Raw View | Raw edit | More topic actions...
Topic revision: r2 - 2013-10-22 - RayPlante
 
This site is powered by the TWiki collaboration platform Powered by Perl This site is powered by the TWiki collaboration platformCopyright © 2008-2024 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback