DMTM: Data Models Technical Meeting, CfA 2002 Oct 11-12 Notes by Jonathan McDowell and Janet De Ponte Evans These notes should be read in conjuction with the summary presentation made by Jonathan McDowell to the IVOA Interop meeting on Oct 17 which contain the meat of the meeting's results. Attending: NVO/CfA: Jonathan McDowell Ian Evans Arnold Rots Janet DePonte Evans Pepi Fabbiano (part of time) Martin Elvis (part of time) NVO/UIUC: Ray Plante NVO/NRAO: Doug Tody AVO/Strasbourg: Mireille Louys Astrogrid/RAL: Dave Giaretta Astrogrid/LUX: Clive Page Astrogrid/LUX: Patricio Ortiz We met to discuss VO data modelling efforts - the abstract representations of astronomical data - and work on ways to ensure that the systems developed by the IVOA members are interoperable. The common idea was that we should carefully define and describe what our data are at a conceptual level before pressing forward with software implementations (otherwise, you can end up with a lot of missed opportunities for commonality, for instance). We tentatively agreed on a few things: a possible process for DM development, what minimum requirements on VO DMs should be, and what will and won't be called an "image". I've outlined these bits in asterisks. Nevertheless these should in no way be construed as final decisions or standards, since no DMs have yet gone through such a process. The first morning was mostly dedicated to presentations. Dave Giaretta reported on his Astrogrid model, presenting UML diagrams of the model (see his presentation). He discussed the proposed HDX object which is an abstraction and generalization of Starlink's HDS and FITS, containing metadata, structured objects, and the special NDX object for a n-dimensional image stack with pixel array, quality array, bad pixel mask, coordinates, history, title, units and user defined extensions. Mireille Louys presented the IDHA image model that she and her Strasbourg colleagues are working on. It is oriented to 2-dimensional data but they are hoping it is extensible. She also commented on UCDs and their current limitations (flat name tags, need modelling to arrange the concepts). Ian Evans presented ideas based on the work done at CfA, describing a framework for data models which allows extensibility and linkage between objects. Ray Plante presented on Metadata Frameworks. General discussion followed until lunch. D. Tody noted terminology difference between NVO DM draft paper and proposed ISO 11179 standard - 'descriptor' in DM draft is an enhanced version of 'attribute' in ISO 11179 (perhaps we could use 'science attribute') and 'descriptor' in ISO 11179-3 is used to mean what I would be tempted to call an attribute of the descriptor - things like the name, cardinality, definition, etc. [Well, my 'descriptor' terminology for the CXC DM dates back to 1995, I can't help it if the more recent ISO 11179 is getting it wrong :-)] Friday afternoon: Modelling exercise on 'spectral bandpass'. Proposed attributes o min wavelength max wavelength effective wavelength identifier transmission as a function of wavelength (measured or modelled) name (specific: J-Arizona; generic: J; waveband: "Near IR" ) o data quality needs to be associated with all data - complex quality object needs to be modelled. Also uncertainties, etc. o Questions to ask The list of questions (methods) we came up with were basically access functions to the attributes listed above. We noted also a requirement for multiplication and addition of bandpass transmissions o Problems representing the bandpass associated with an observation: Red leaks, periodic filters, wedge and band stop filters, multiple filters, atmospheric transmission. We grouped the attributes into three groups: (A) the filter response (most detailed), (B) the min/max/eff. lambda (less detailed), and (C) the waveband name (least detailed?). There was some argument about whether these were different (and differently detailed) representations of the same information, or whether they were fundamentally different; i.e. whether we should have separate classes (A),(B),(C) which were specializations of a generic bandpass class, or whether we have a single class which has all of (A),(B),(C) as attributes. We later (I think on Saturday afternoon) revisited this and started the usual war on which of wavelength/frequency/energy to use. This turned into a model for wavelength/etc. with attributes of mode (which of wavelength, frequency, energy to use), value and unit: Possible XML Wavelength 512.02 nm The same model might be rendered in XML by more convenient tags: 512.02 although it would be nice if the tags clearly reflected the model being used. Friday afternoon: Towards a process for the IVOA Decided to use 'Convention' instead of 'Standard': standard implies an authoritative body, convention is less rigid and more democratic? What does it mean to have a standard for data models?? RPlante -- have to describe model in a way it can be verified by a program that is has the things in it that VO DM needs to have. JCM -- we need real examples in XML What are the compnents of the DM that we need to have? DTody -- name, then white paper that describes the DM that defines the attributes of the DM. Then an XML scheme that can be verified. Then we get to representations in whatever language. A Conforming DM should have certain things in it: ********************************************* Proposal for IVOA Conforming DM ********************************************* unique URI unique name -- defines name space version -- (or use separate name space?) description white paper (url) curation metadata class descriptions: class: is a type and doesn't have a value . Unique name within the model (namespace) . Description of class . Properties - relationships between one class and another ?inherits -- child of, simplified version of, ... (allowed relationships to be determined) . Attributes - name and definition - type (UCD?) - allowed values, default, null - cardinality (0, 1 or more..) . Methods - list names and interfaces . Abstract vs. concrete class? The process for turning a data model into software: UML (diagram) -> XMI (XML interchange format) -> Schema (XML software) Many schemas can represent the same UML/XMI. There will be a default schema implied by the conversion software, but this may not reflect how astronomers want to deal with the objects. So, there may be human intervention to design the implementation of each data model. Agreed process of deploying a data model: **************************************************************************** Proposal for IVOA Data Model Development Process **************************************************************************** 1) - Make text white paper 2) - Make UML diagram 3) - Implement a VO conforming (compliant) data model - must satisfy above description - groups involved should coordinate - doesn't have to be "approved" by IVOA 4) - Optional step for DMs considered to have good extensibility, generality, commonality with other VO-recommended DMs, etc: adopt as 'VO conventional' or 'VO recommended' DMs - IVOA Interop (?) to approve - Must have reference XML Schema implementation - Must have reference XML instances (to make it clear what the intent is) ****************************************************************** We didn't make this explicit, but pending a software registry I would encourage developers to register each stage of this process by sending email to dm@us-vo.org. We decided not to constitute an approval committee or standards body, at least at this time. Suggested the W3C RFC model as preferable to the FITS standardization process. Suggested we need to define a common (consistent) language for attribute names - vocabulary registry? ===== Start of Saturday ====== o Ideas for today: Comparison of model work -- 1 hr -- break -- 10:45 Interoperability report -- lunch Pursue the band pass model We revisited the text of the 'Compliant DM' agreeement. Suggested that 'adopted model' would be a good form of words for those preferred by the collaboration. o DM discussion: What is data?? - hypermaps (images, tables, ....) - documents - code - plots - link We'll concentrate on the usual image/table/structure data, but recall that a true data bundle may contain these other kinds of thing. o Ray's rule: Never use 'is-a' when 'has-a' will do o Clive's pithy quote: Any software problem can be solved by adding a layer of abstraction, except the problem of having too many layers of abstraction. o DTody : coordinate system stand alone by itself o RPlante : need for a property 'derived from' general thing is a dataset 2 classes for observation in Mireille's diagram: raw & processed - do we want these to be more the same? (we want to use general image methods on raw data: possibly "raw observation" and "processed observation" are really "metadata for raw observation" and "observation (raw or processed)".) Ray argued strongly for a variety of named relationships in the modelling process. o Compared Dave and Mireille's models. Key object mappings: ------------------ ------------------------ Dave Mireille ------------------ ------------------------ data object raw obs / processed obs ndim data stored data image data - commonality JCM argued for the telescope, instrument, detector objects to be modelled in a way that reflects their role in the progressive journey of the photon through the observation process. - differences: processing section: describe what has been done with the data actual vs. planned metadata distinction bewteen processed tools / analysis tools? level of processing - problems: Mireille's model emphasizes the process of stitching together raw observations to make a processed observation, and in that model the telescope/instrument metadata are kept with the raw observation object, rather than being directly a property of the processed observation. Although I might claim this doesn't represent most current practice (processed observation FITS headers contain a lot of telescope observation) it solves a generic problem with merging headers: what do you put for the INSTRUME keyword when making a single combined image from, say, HST and Chandra data? In general, you have the choice that every keyword can turn into an array of values, or that you use a processing history to follow links back. This also reminds us to be careful to distinguish in our models between things like the software pixel in our images and the hardware pixel of the instrument - they are logically distinct even when notionally the same size. In general, deciding whether to propagate information with a link or a copy is a common problem in this kind of work. o key issues for images - components of an image We discussed what we meant by an 'image' There are various kinds of n-dimensional data which may be considered images or may be considered more complicated kinds of dataset. - sparse? - pixelated? regular? - allowed pixel data type? - n-dimensions? - must have ra/dec? - must have >= 1 coord system? - spectrum . an image is arbitrary, irregular, real-valued ? . not a mosaic ? . need to recognize there are things we aren't experts in and we need to model later on. . extensible scheme -- can't overspecify or overspecialize **************************************************************************** Proposal for IVOA Image Definition **************************************************************************** We ended up with the idea that a regular pixel array is key to the definition of an image: so, we'll call something an image if it has - A regular pixellated array - possibly sparsely filled (e.g. with a bad pixel map attached, or encoded in a compact pixel list, etc). - whose pixel values are simple scalar data types (integer, real, complex; possibly character?; array-valued and object-valued pixels are not allowed) - n-dimensional, n arbitrary (and in particular not restricted to be 2) Things that do not satisfy this may be important VO data objects, but we won't call them images. Images may have a lot of other things as well (particularly, coordinate systems, observation metadata, attached exposure maps, etc.) and we should model what at least a standard subset of these can be, but this is the minimum requirement for something to be an image. *************************************************************************** o missing so far: -- data quality -- uncertainties -- calibration o different hierarchies - big-picture diagrams which give object hierarchies are useful, but there is no one heirarchy we should bless. It's the individual boxes we should have in common, plus a common framework within which we can define different hierarchies. o Next steps: - Work further on commonality of existing models. - Flesh out details of individual objects - Eventually, interoperability tests of XML generated from such objects.