+Discussion on the IVOA WD on ObsTAP and the ObsCore data model

  • List of current releases during the WG review
    • release v1.0-20110228:
docx pdf Current draft with pointers to topics to discuss

* release v1.0-20110305-2: docx pdf

The authors could not agree on some important points; others were still being worked on at the time the current draft of the specification was released. It was therefore decided to post them on the twiki (see pages below) for an open discussion. We need to finalize these soon and we ask for your help. Please find below the list of topics with indicated the two different perspectives among the co-authors.

Please provide your comments not in this page, but in each of the topic pages listed below:

  1. dataproduct_type: 'other' and subtype. Click here for the discussion page
    • Option 1.A: dataproduct_type should allow 7 well-defined types (see list below) plus 'other' to allow extensibility to allow other science data products which do not fit into any existing category to be represented.
      • image
      • cube
      • spectrum
      • sed
      • timeseries
      • visibility
      • event
      • other
    • The additional and optional 'subtype' attribute would allow the data provider to specify in more detail, in data-provider specific terms, what each data product is (subtype would be free format). Without the 'other' category there is no way to expose archive science data products which do not fit into any of the predefined categories, e.g., to fully represent all archive science data products or to prototype extension of the predefined types to new types of data. This would not interfere with global data discovery where one-and-the-same-query is posed to many services (since the standardized categories are still there and would be used where applicable) but would allow more focused queries to be posed to specific archives, allowing the VO to be used for science archive data access as well as global data discovery. In addition, the 'subtype' field would allow users to understand in more collection-specific terms what each data product actually is.
    • Option 1.B: It shall contain the first 7 types, but shall not contain 'other' nor a free format subtype because (1) this was not part of the requirements for ObsTAP (not a single use case covered this) (2) 'other' -along with a free format subtype- would open up the possibility for data providers to come up with their own schemas, therefore defeating the main idea of ObsTAP, that is, the possibility to send one-and-the-same-query to all obstap services. [Note: the proponents of having an 'other' category to permit extension do not think this possible abuse would actually be a problem since the subtype attribute already provides sufficient means to more fully specify the scientific type of a data product.] 'other' could correspond to different types in different archive implementations.
    • In either case if other types are needed we could eventually amend the document and extend the list of standard data product types without breaking existing services.
  2. Best UTYPE string-format: for humans (camelCase), for DB systems (lowercase) Click here for the discussion page
    • Option 2.A: UTYPEs shall be treated as case-insensitive strings, favouring camelCase for readibility. This continues the current situation in VO where Utypes are already specified to be case-insensitive; doing otherwise would invalidate standards which have been in place for several years, including both existing standards documents and implementations. Scientifically it is clearly safer to consider (for example) DataID.CreatorDID and DataId.CreatorDid (or various other permutations) to refer to the same data model attribute. In no case would we actually have tokens such as these which differ only in case, so performing a case insensitive comparision is clearly the safest approach. Given that Utypes are used all the way up into science applications code, and may pass through many layers of software before being used by a service, it is difficult to guarantee that case will be preserved, hence a case-insensitive comparision is wisest.
    • Option 2.B: UTYPEs shall be lowercase to avoid confusion, to avoid extra code to handle all possible cases in clients, DBMSes, etc.. A case-insensitive string cannot be searched efficiently in all DBMSes. E.g. the clause WHERE LOWER(utype)='abc.def' forces a table scan (no index can generally be used). [Note: some of us do not agree with this assertion; performing efficient searches on case insensitive strings is very doable, and in fact is done every day. Although the support for case-insensitive searches in DBMSes is variable, we have not yet found a case where it cannot be done efficiently, nor should this technical issue be the driver when scientific analyis will be at risk if data model elements cannot be identified merely due to a user confusion over upper or lower case terms.]
  3. obs_title data model field: free format? Click here for the discussion page
    • Option 3.A: obs_title shall be an optional data model field, a free format string that allows data providers to describe any given dataset in more details than otherwise possible, e.g.:
      • dataproduct_type='image'
      • obs_title='Stokes I continuum image at 1420 Mhz'
    • Without this it could be very difficult for a human user to understand what a specific data product is. While the quantitative fields are essential for global discovery there is still a need for a human looking at the results of a search to understand what a specific data product is. Furthermore obs_title is already a mandatory field for all existing DAL interfaces, e.g., SIA, SSA. If not a mandatory field of ObsCore it should at least be an optional field. This field has long been free format in the other DAL interfaces (and in FITS before that) with no problem; it does not matter since it is merely a descriptive field intended for human consumption. Note also that even within the TAP_SCHEMA all schema elements (schemas, tables, columns, etc.) have a description field which is free format and this has never posed a problem.
    • Option 3.B: Not clear what the requirements are, and a completely free format field could be an issue. There is no time now to study this. Let's leave it optional until we clarify requirements and usage. [Note: the proponents of obs_title already agreed that it could be optional. The main thing is to get it into the output metadata so that the user can understand what a specific data product is.].
  4. access_format Click here for the discussion page
    • This is still being defined.
  5. Usage of obs_publisher_did Click here for the discussion page
    • This attribute, and the general issue of dataset identifiers is well defined and has already been in use in existing standards for several years (e.g., ADS dataset identifiers, IVOA dataset identifiers as defined by the Registry, publisher dataset identifiers and others as used in SSA and Spectrum). However some of the authors felt that existing discussion of usage was needed. One serious issue is that the standard currently specifies that obs_publisher_did cannot be used in a WHERE clause, which defeats the whole idea of a dataset identifier being used to persistently identify and later retrieve or access specific datasets.
  6. Discovery of Polarization data Click here for the discussion page
    • About the importance of polarization data
  7. em_min and em_max: in the current ObsTAP proposal they are allowed to be NULL. What does it entail? Click here for the discussion page
  8. Suggestions for the draft SugText
  9. Suggestions for new UCDs to support ObsTAP: Click here for the discussion page
  10. Discovery of radial-velocity/redshift data : Click here for the discussion page
  11. Solar System : Click here for the discussion page

and feel free to add a new item when needed. Thanks.


I am a bit bothered by Collection and I don't know where it came from. If I read the draft correctly, this is an amalgam of datacenter, observatory, telescope, and instrument. This is not necessarily how repositories organize their data. It would be cleaner to keep the concepts separate.

I still object to forcing spectral limits in m. If we really (really) want to force a single unit, Hz makes much more sense. Wavelength is unnatural for frequency as well as energy-based data.

I am not sure that bibliographic references are helpful since they are likely to be incomplete and unevenly covered. Better leave that to bibliographic services.

Event lists are not by themselves multi-file. However, many datasets are by their nature multi-file. What do we do then with the access format, or even datatype?

It may be good to say explicitly that exposure time also takes into account deadtime (I mean, excludes it).

Release date does not explicitly say "public release" which is confusing. One has to read much further to find out that that is really what is meant.

-- ArnoldRots - 22 Mar 2011


Topic attachments
I Attachment Action Size Date Who Comment
Unknown file formatdocx WD-ObsCore-v1.0-20110228-1pdf.docx manage 348.2 K 2011-02-28 - 20:28 MireilleLouys ObsTAP draft current version
PDFpdf WD-ObsCore-v1.0-20110228-1pdf.pdf manage 1716.2 K 2011-02-28 - 20:29 MireilleLouys pdf version
Unknown file formatdocx WD-ObsCore-v1.0-20110305-2.docx manage 330.7 K 2011-03-08 - 09:48 MireilleLouys ObsCoreDM WD -Word2010
PDFpdf WD-ObsCore-v1.0-20110305-2.pdf manage 1339.6 K 2011-03-08 - 09:47 MireilleLouys ObscoreDM WD
Edit | Attach | Print version | History: r21 < r20 < r19 < r18 < r17 | Backlinks | Raw View | Raw edit | More topic actions...
Topic revision: r19 - 2011-03-22 - ArnoldRots
 
This site is powered by the TWiki collaboration platformCopyright © 2008-2019 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback