DMTM:  Data Models Technical Meeting, CfA  2002 Oct 11-12

Notes by Jonathan McDowell and Janet De Ponte Evans 

These notes should be read in conjuction with the
summary presentation made by Jonathan McDowell to the IVOA
Interop meeting on Oct 17 which contain the meat of the
meeting's results.

Attending:
NVO/CfA: Jonathan McDowell
         Ian Evans
         Arnold Rots
         Janet DePonte Evans
         Pepi Fabbiano (part of time)
         Martin Elvis (part of time)
NVO/UIUC:  Ray Plante
NVO/NRAO:  Doug Tody
AVO/Strasbourg: Mireille Louys
Astrogrid/RAL:  Dave Giaretta
Astrogrid/LUX:  Clive Page
Astrogrid/LUX:  Patricio Ortiz    

We met to discuss VO data modelling efforts - the abstract
representations of astronomical data - and work on ways to ensure that
the systems developed by the IVOA members are interoperable. The common
idea was that we should carefully define and describe what our data are
at a conceptual level before pressing forward with software
implementations (otherwise, you can end up with a lot of missed
opportunities for commonality, for instance).

We tentatively agreed on a few things: a possible process
for DM development, what minimum requirements on VO DMs
should be, and what will and won't be called an "image".
I've outlined these bits in asterisks. Nevertheless these
should in no way be construed as final decisions or standards,
since no DMs have yet gone through such a process.

The first morning was mostly dedicated to presentations.

Dave Giaretta reported on his Astrogrid model, presenting UML
diagrams of the model (see his presentation).
He discussed the proposed HDX object which is an abstraction
and generalization of Starlink's HDS and FITS, containing
metadata, structured objects, and the special NDX object
for a n-dimensional image stack with pixel array, quality array,
bad pixel mask, coordinates, history, title, units and user defined
extensions.

Mireille Louys presented the IDHA image model that she and her
Strasbourg colleagues are working on. It is oriented to 2-dimensional
data but they are hoping it is extensible. She also commented
on UCDs and their current limitations (flat name tags, need
modelling to arrange the concepts).

Ian Evans presented ideas based on the work done at CfA, 
describing a framework for data models which allows extensibility
and linkage between objects.

Ray Plante presented on Metadata Frameworks.


General discussion followed until lunch.

D. Tody noted terminology difference between NVO DM draft paper and
proposed ISO 11179 standard - 'descriptor' in DM draft is an enhanced
version of 'attribute' in ISO 11179 (perhaps we could use 'science
attribute') and 'descriptor' in ISO 11179-3 is used to mean what I would
be tempted to call an attribute of the descriptor - things like the
name, cardinality, definition, etc. [Well, my 'descriptor' terminology
for the CXC DM dates back to 1995, I can't help it if the more
recent ISO 11179 is getting it wrong :-)]


Friday afternoon: Modelling exercise on 'spectral bandpass'.

Proposed attributes

  o 	min wavelength
	max wavelength
	effective wavelength
	identifier
	transmission as a function of wavelength (measured or modelled)
	name (specific: J-Arizona; generic: J; waveband: "Near IR" )

  o data quality needs to be associated with all data - complex quality object
    needs to be modelled.
    Also uncertainties, etc.

  o Questions to ask
      The list of questions (methods) we came up with were
     basically access functions to the attributes listed above.
     We noted also a requirement for 
     multiplication and addition of bandpass transmissions
 
  o  Problems representing the bandpass associated with
     an observation:
   Red leaks, periodic filters, wedge and band stop filters,
   multiple filters, atmospheric transmission.

 We grouped the attributes into three groups: (A) the filter response
  (most detailed), (B) the min/max/eff. lambda (less detailed), and (C)
 the waveband name (least detailed?). There was some argument about
 whether these were different (and differently detailed) representations
 of the same information, or whether they were fundamentally different;
 i.e. whether we should have separate classes (A),(B),(C) which
 were specializations of a generic bandpass class, or whether we have
 a single class which has all of (A),(B),(C) as attributes.
 
 We later (I think on Saturday afternoon) revisited this and
 started the usual war on which of wavelength/frequency/energy to use.
 This turned into a model for wavelength/etc. with attributes
 of mode (which of wavelength, frequency, energy to use), value and unit:
 Possible XML
   <ENERGY>   <!-- but this is not a good name! -->
   <MODE>Wavelength</MODE>
   <VALUE>512.02</VALUE>
   <UNIT>nm</UNIT>

 The same model might be rendered in XML by more convenient tags:
   <WAVELENGTH UNIT="nm">512.02</WAVELENGTH>
 although it would be nice if the tags clearly reflected the model
 being used.

	
Friday afternoon: Towards a process for the IVOA

Decided to use 'Convention' instead of 'Standard': standard implies an
authoritative body, convention is less rigid and more democratic?

What does it mean to have a standard for data models??

  RPlante -- have to describe model in a way it can be verified by a program
	     that is has the things in it that VO DM needs to have.

  JCM -- we need real examples in XML

What are the compnents of the DM that we need to have?

  DTody -- name, then white paper that describes the DM that defines the
	   attributes of the DM.  Then an XML scheme that can be verified.
           Then we get to representations in whatever language.

A Conforming DM should have certain things in it:

*********************************************
Proposal for IVOA Conforming DM
*********************************************

  unique URI
  unique name -- defines name space
  version -- (or use separate name space?)
  description
  white paper (url)
  curation metadata
  class descriptions:
    class: is a type and doesn't have a value
	. Unique name within the model (namespace)
        . Description of class
	. Properties - relationships between one class and another	
		?inherits -- child of, simplified version of, ...
		(allowed relationships to be determined)
	. Attributes 
		- name and definition
		- type (UCD?)
		- allowed values, default, null
		- cardinality (0, 1 or more..)
	. Methods 
		- list names and interfaces
	. Abstract vs. concrete class?

The process for turning a data model into software:
   UML (diagram) -> XMI (XML interchange format) -> Schema (XML software)

Many schemas can represent the same UML/XMI. There will be a default
schema implied by the conversion software, but this may not reflect
how astronomers want to deal with the objects. So, there may be 
human intervention to design the implementation of each data model.

Agreed process of deploying a data model:

****************************************************************************
Proposal for IVOA Data Model Development Process
****************************************************************************
 1) - Make text white paper 
 2) - Make UML diagram 
 3) - Implement a VO conforming (compliant) data model
		- must satisfy above description
		- groups involved should coordinate 
                - doesn't have to be "approved" by IVOA
 4) - Optional step for DMs considered to have good extensibility,
         generality, commonality with other VO-recommended DMs, etc:
    adopt as 'VO conventional' or 'VO recommended' DMs
		- IVOA Interop (?) to approve
		- Must have reference XML Schema implementation
		- Must have reference XML instances
		   (to make it clear what the intent is)

******************************************************************

We didn't make this explicit, but pending a software registry
I would encourage developers to register each stage of this
process by sending email to dm@us-vo.org.

We decided not to constitute an approval committee or standards body,
at least at this time. Suggested the W3C RFC model as preferable
to the FITS standardization process.

Suggested we need to define a common (consistent) language for attribute
names - vocabulary registry?

===== Start of Saturday ======

o Ideas for today:
	Comparison of model work -- 1 hr
	-- break -- 10:45
	Interoperability report
	-- lunch
	Pursue the band pass model

We revisited the text of the 'Compliant DM' agreeement.
Suggested that 'adopted model' would be a good form of words
for those preferred by the collaboration.


o DM discussion:

  What is data??
	- hypermaps (images, tables, ....)
	- documents
	- code
	- plots
	- link

  We'll concentrate on the usual image/table/structure data, but
  recall that a true data bundle may contain these other kinds of thing.

o Ray's rule: Never use 'is-a' when 'has-a' will do

o Clive's pithy quote: Any software problem can be solved by adding
  a layer of abstraction, except the problem of having too many layers
  of abstraction.

o DTody : coordinate system stand alone by itself

o RPlante :
         need for a property 'derived from'
	 general thing is a dataset
	 2 classes for observation in Mireille's diagram: 
           raw & processed - do we want these to be more the same?
  (we want to use general image methods on raw data: possibly
   "raw observation" and "processed observation" are really
   "metadata for raw observation" and "observation (raw or processed)".)
   Ray argued strongly for a variety of named relationships
   in the modelling process.  

o Compared Dave and Mireille's models.

Key object mappings:
   ------------------   ------------------------
	Dave			Mireille
   ------------------   ------------------------
   data object		raw obs / processed obs

   ndim data		stored data
     image data

  
  - commonality 

  JCM argued for the telescope, instrument, detector objects
   to be modelled in a way that reflects their role in the progressive
   journey of the photon through the observation process.    


  - differences:
	processing section: describe what has been done with the data
	actual vs. planned metadata 
	distinction bewteen processed tools / analysis tools?
	level of processing
 
  - problems:
      Mireille's model emphasizes the process of stitching together
   raw observations to make a processed observation, and in that model
   the telescope/instrument metadata are kept with the raw observation
   object, rather than being directly a property of the processed
   observation. Although I might claim this doesn't represent most current
   practice (processed observation FITS headers contain a lot of 
   telescope observation) it solves a generic problem with merging
   headers: what do you put for the INSTRUME keyword when making
   a single combined image from, say, HST and Chandra data?
     In general, you have the choice that every keyword can turn
     into an array of values, or that you use a processing history
     to follow links back. This also reminds us to be careful to distinguish
    in our models between things like the software pixel in our images
    and the hardware pixel of the instrument - they are logically distinct
    even when notionally the same size.
 
   In general, deciding whether to propagate information with a
   link or a copy is a common problem in this kind of work.

o key issues for images - components of an image

  We discussed what we meant by an 'image'
  There are various kinds of n-dimensional data which may be
  considered images or may be considered more complicated kinds
  of dataset.

	- sparse?
	- pixelated?  regular?
	- allowed pixel data type?
	- n-dimensions?
	- must have ra/dec?
	- must have >= 1 coord system?
	- spectrum

   . an image is arbitrary, irregular, real-valued ?
   . not a mosaic ?
   . need to recognize there are things we aren't experts in and
	we need to model later on.  
   . extensible scheme -- can't overspecify or overspecialize

****************************************************************************
Proposal for IVOA Image Definition
****************************************************************************

  We ended up with the idea that a regular pixel array is key
  to the definition of an image: so, we'll call something
  an image if it has
      - A regular pixellated array
      - possibly sparsely filled (e.g. with a bad pixel map attached,
         or encoded in a compact pixel list, etc).
      - whose pixel values are simple scalar data types
          (integer, real, complex; possibly character?; 
           array-valued and object-valued pixels are not allowed)
      - n-dimensional, n arbitrary (and in particular not restricted to be 2)
  Things that do not satisfy this may be important VO data objects, but
  we won't call them images.
  Images may have a lot of other things as well (particularly, coordinate
  systems, observation metadata, attached exposure maps, etc.) and
  we should model what at least a standard subset of these can be, but
  this is the minimum requirement for something to be an image.

***************************************************************************
       

o   missing so far:
	 -- data quality
	 -- uncertainties
	 -- calibration

o  different hierarchies - big-picture diagrams which give
    object hierarchies are useful, but there is no one heirarchy
    we should bless. It's the individual boxes we should have in
    common, plus a common framework within which we can define
    different hierarchies.

o Next steps:
     -  Work further on commonality of existing models.
     -  Flesh out details of individual objects
     -  Eventually, interoperability tests of XML generated from such objects.