RegisteringBestPracticesDisc < IVOA

IVOA Web>IvoaResReg>RegistryOperations>RegisteringBestPracticesDisc (2013-10-23, RayPlante) (raw view)
---+ Discussion of Best Practices for Registering Resources

| *Please add your comments:* ||
|| Please add your discussion at the end of the Comments/Discussion section; be sure to identify them with your signature (e.g. -- RayPlante - 2013-10-22).  You may also insert brief annotations into the preceding sections; label these as well (e.g "([[RayPlante][RP]])" ) |

%TOC%

---++ Background

Over several Interops, we have discussed some best practices regarding how to register scientific data collections and the services that access them.  The motivation for such practices include:

   * making them easier to find under targeted searches.  We want to make sure that resource descriptions include important information that will likely be used in queries
   * make search responses more comprehensible by, for example,...
      * avoiding displays of search results that appear to have multiple occurances of the same resource (when they are actually subtly different)
      * making it clearer what the individual resources represent
      * distinguishing between "original" published data and "mirrored" or "re-published" data

Among the ways we have discussed doing this is by promoting a uniform pattern for registering resources.  The idea is that if collections and their services were registered according to a uniform convention (and that convention could be recognized when in use), client applications (such as the [[http://vao.stsci.edu/discover][VAO DDT]]) could provide a more meaningful and easier to understand display of search results.  

---+++ A Problem with Early Proposals

One convention that has been proposed and which is already in some use now involved separating the description of the underlying collection from the description of the services that access it into separate resources (e.g. [[%PUBURL%/%WEB%/InterOpMay2013Registry/IVOA-RWG-upgrades-rofr.pdf][see slide 5, "A Common Registration Pattern" of Plante's May2013 presentation]]).  That is,
   * a data collection would be registered as a =DataCollection= resource; this discription would include all of the science-related information in it (including table column descriptions, if applicable).  
   * services that access the collection registered collectively but separately (as a =DataService= or =CatalogService=)  
   * relationship links would connect to the collection with its services.

At the [[InterOpSep2013Registry][Sept. 2013 Interop]], Markus Demleitner reported on some important disadvantages this approach presents for some simple but important search use cases using the TAP interface (see [[%PUBURL%/%WEB%/InterOpSep2013Registry/regtap.pdf][slide 6, "Uneasy Relationships", and beyond of his presentation]] for details).  In particular, separating the access metadata (in the Service resource) from the science metadata (in the =DataCollection= resource) requires some fairly complex joining using relationship information.  It is easy to argue that under this convention, certain simple queries can not be expressed simply.

---+++ Search Requirements and Use Cases

A well-annotated description of requirements and use cases for searching for resources is available via the RestfulRegistryInterfaceReq page.  There are a few use cases that are particularly relevent to this discussion:

   1. We want to be able to find resources based on scientific topics.  This requires that resource descriptions include specific words about the science behind the data.  
      * _RayPlante notes_:  this suggests that data collections be registered at a somewhat fine-grained level (a la Vizier catalogs).
   2. We want to be able to find all services supporting a particular data access protocol (e.g. all SIA services or all TAP services)
   3. Combination of 1. and 2:  find all collections related to a particular science topic that supports a particular protocol (e.g. find molecular cloud observations accessible via SSA) ; _return the accessURLs_.  

The third use case in which we need the accessURL (access metadata) along with the metadata containing the science metadata (e.g. description, column information) that makes queries messy when the science and access metadata are in different resources.  

---++ A Revised Proposal: a single resource per collection

We can avoid the joining mess described above if the science and access metadata are included in the same resource description.  Doing so would also make it easier to interpret search results.  An apt proposal, then, would be to combine the role of the =DataCollection= and =DataService= (or rather =CatalogService=) into a single resource type.  

In detail the proposal would be described as follows:
   * Every data provider is registered via an =Organisation= resource (and their authorities registered separately as well)
   * Each data collection published by the provider is registered with a resource type<sup>*</sup> that includes both science metadata describing what is in the collection and =capability= elements for each of the services that access the collection.  
   * =relationship= elements would be used to connect it to other related collections.  In particular, if a collection is mirror or derived from another collection, it would include a relationship that points back to the source collection. 

---+++ <sup>*</sup>The Collection Resource Type and the Evolution of a Resource

There is some question as to the "proper" resource type we should use--that is, whether we can use an existing type or we would need to define a new one. We observe that the existing =CatalogService= resource type provides all of the metadata required to describe a collection and its services.  Re-using this type would obviously have no impact on the TAPReg schema.  Still, the semantics of this type, one could argue, do not quite match that of a "collection".  

For example, we allow (and in fact encourage) providers to register collections even before there are any services available to access it.  Perhaps there is only a web page (which would be given via the =referenceURL= element).  Perhaps the collection is a simple catalog that is downloadable as a single file but not yet searchable via ConeSearch.  Semantically, this resource has risen yet to the status of a "service".  Nevertheless, a =CatalogService= resource record is not _required_ to have any =capability= elements, so syntactically, a =CatalogService= resource would work just fine.  

There is one feature of a =DataCollection= that is not strictly present in a =CatalogService=:  a =DataCollection= provides a special =accessURL= element for accessing the collection as a whole (e.g. as a single file or, say, a directory of files).  However, this would simple to capture in a generic =capability= if need be.  (It would be worth defining a standardID URI to identify it; a capability extension would not be necessary.)

If we felt that semantics were important (or we had other reasons not to use the =CatalogService= type in this role), defining a new type need not be difficult since no new metadata elements need to be defined.  (It might look as simple as [[%ATTACHURL%/DataCollection-v1.0.xsd][this]].)  For this reason, the impact on existing registries would likely be minimal.  One reason to define a new type (that otherwise looks just like =CatalogService=)--let's call it =DataCollection2= --would be provide an explicit signal to clients that a resource complies with the unified convention.  In particular, if one found a =DataCollection2= resource, one could be certain that there are no separate resources describing its services, whether it has =capability= elements or not. 

So, we conclude that we have two choices for registering data collections and their services together:
   * use =CatalogService=, ignoring the semantic inaccuracies
   * define a new resource type (e.g. =DataCollection2=) with the correct semantics but the same contents  as =CatalogService=.

_For purposes of discussion below, we will refer to collection resource registered according to this proposal as =DataCollection2= resource, regardless of whether it is represented as =CatalogService= or as some new resource type._

---+++ Services accessing Multiple Collections

There exists several examples in which a provider puts up a single service to access many collections.  For most of our DAL services, this fact is not important to clients.  An exception, however, is TAP:  a TAP service that can access several catalogs means the user can make ADQL joins across these tables.  This is a fact that is potentially useful to users.  

Related to this is an important [[RestfulRegistryInterfaceReq][search use case]]: find me all TAP services (i.e. with no other qualifications).  This might be used by a specialized client application that can query an arbitrary TAP service; it may wish to list all available TAP services as a "pick-list".  This one place where you might want to see that a single service can access many collections.

To address this case, we propose:
   * Each TAP-accessible catalog is registered separately as described above to segregate the science metadata.  Each will include a description of the tables and columns in the particular catalog.
   * A separate =CatalogResource= record is provided to represent the TAP service as a whole.  It includes its TAP =capability= entry along with the "service-for" =relationship= element pointing to each of the individual catalog resources.
   * The TAP =capability= entry from the aggregating TAP service is replicated into all of the individual catalog resources.  

---+++ Archives Hosting Multiple Collections

There are many cases in which a single archive hosts multiple collections.  In this section, we describe how to apply the proposed convention to such archives.  We consider three cases:
   1.  A  data center that hosts archives for several missions or observatories (e.g. the IPAC case)
   2.  A repository containing many astronomer-published collections of processed data (e.g. the DataVerse case). 
   3.  An observatory archive containing raw or pipeline-processed products from many guest-observer observations (e.g. the Chandra or HST case)

In practice, these cases may overlap and so the publisher would use judgement as to how combine these cases appropriately.  

How much detail is provided in the resource registrations is another choice of the provider.  Below, we describe the most detailed representations; however, in practice a publisher may start more simply and add more resources/detail as person-resources allow.  

---++++ Case 1: The Data Center

A canonical example of this case might be NASA IPAC which hosts archives for several NASA missions as well as the heterogeneous information system, NED.   In the spirit of the proposed convention, the center's assets would be registered as follows:

   * the data center itself would be registered as an =Organisation=; this resource would be referred to as the publisher of all of the data resources.
   * each mission archive would be registered as a separate =DataCollection2=.  It would include a =capability= for each standard and archive-specific service that can access that collection.
      * any NED- or SIMBAD-like resource would also be represented as a =DataCollection2=; it's custom services could be enumerated as capabilities.
   * if there is some service that can search across all of the archives, it could be represented by either an "uber" =DataCollection2= resource (representing the union of all of the collections) or simply as a =CatalogService=.  The individual mission resources would have "part-of" =relationship= entries that point to this "uber" resource.

---++++ Case 2: The Data Repository

The canonical example of this case might be a DataVerse repository or the ADIL where individual astronomers or research groups might deposit data for publishing into the VO.  Their data would be related by some common science investigation we might refer to as a "project".  

This would be represented in a manner similar to the Data Center case where we substitute "mission" with "project":

   * The institution running the repository would have an =Organisation= record that would serve as the repository's publisher.  
   * The repository would be represented by a =DataCollection2= resource.  
      * It would include a =Type= element set to "Archive".  
      * Any service that is capable of searching across the entire repository would be listed as a capability.
   * Each collection within the repository would be represented by a =DataCollection2= resource.  
      * Each would contain science metadata specific to that collection.  
      * It would include a =capability= element for each service that accesses the collection (or part thereof)

---++++ Case 3: The Guest Observatory Archive

A "Guest Observer" facility is an observatory that grants observing time to astronomers via some proposal mechanism and carries out heterogeneous observing plans according to the approved proposals.  The associated archive may contain the raw data and/or pipeline-processed data.  If the data archive has access to the proposal database, the archive could be treated just like a data repository described in Case 2; the "proposal" would represent the "project" and would be the source of the science metadata.  

Reasons that a publisher would _not_ want to represent the archive like a data repository might include: 
   *  by policy or due to technical reasons, the proposal information is not available.
   *  the number of proposals is too large to effectively curate/maintain as separate resources
   *  the dynamic nature of the archvie--the fact that new projects are being added all the time--makes it impractical to curate/maintain the proposal-projects as separate resources

Assuming that the publisher _does not_ want to separately register each proposal project separately, the recommended way to register would be as follows:

   *  The observatory is registered as an =Organisation= to serve as the publisher of the archive.
   *  The archive is represented as a =DataCollection2= resource.  
      * It would include a =Type= element set to "Archive".  
      * Each service for access data in the archive is listed as a =capability=.  

-- IVOA.RayPlante - 2013-10-22
--------

---+ Comments/Discussion

<!--
      * Set ALLOWTOPICRENAME = IVOA.TWikiAdminGroup
-->
Attachments
Topic attachments
I	Attachment	History	Action	Size	Date	Who	Comment
xsd	DataCollection-v1.0.xsd	r1	manage	2.2 K	2013-10-23 - 09:40	RayPlante	a possible VOResource extension defining DataCollection2
Topic revision: r4 - 2013-10-23 - RayPlante
IVOA
Log in or Register
IVOA.net
Wiki Home
WebChanges
WebTopicList
WebStatistics
Twiki Meta & Help
IVOA
Know
Main
Sandbox
TWiki
TWiki intro
TWiki tutorial
User registration
Notify me
Working Groups
Interest Groups
Time Domain
Committees
Stds&Procs
www.ivoa.net
Documents
Events
Members
XML Schema