Rationale and charter of the WG

Early in the work of the NVO and AVO/Interop teams a group of us emphasized the need for standards describing the structure and semantics of astronomical data to permit dataset interoperability. The VO will enable astronomers, and writers of astronomical software, to more easily locate public archival data, but that is only of limited use if all of the datasets are in different, mutually incompatible, and poorly described formats. The DM WG was established to attack this problem.

A data model, in the sense used by the WG, is an abstract description of concepts and their interrelationships, used to fix both the names and meanings of concepts in the VO context and also their internal structure and cross-connections.

The WG was created after the AVO demo meeting at Jodrell Bank in Jan 2003, following collaboration between the NVO Data Model WBS in the USA and interested parties in the AVO projects in the UK and France which led to a Data Model Technical Meeting in Cambridge (USA) in Oct 2002. Jonathan McDowell (SAO/Cambridge, USA) was appointed chair of the WG. At the Cambridge (UK) Interoperability Meeting in April 2003, the WG discussed issues of scope and process. It was decided that an IVOA data model would consist of:

(1) A white paper discussing and naming the concepts involved and describing the model and its intent in text form.

(2) A set of UML diagrams (including at least class diagrams) constituting a formal definition of the model.

(3) An XML schema file providing a reference serialization of the model. The schema does not define the model, but validates it and clarifies the intent of the model. XML document instances should also be provided as examples.

A combination document will then be circulated to the IVOA as an IVOA Working Draft.

Although other technologies could have been chosen (OWL or other ontology tools rather than UML, FITS serializations rather than XML), the above choices reflect a reasonable combination of simplicity, familiarity to most WG members, and consistency with the rest of VO work. Of course the approach is subject to future review by the WG, but so far the object-oriented rather than ontological methodology is what has been agreed on.

Although almost all aspects of the VO could benefit from a standard data model, we decided in Cambridge that the WG itself would focus on data models for datasets themselves, to solve the problem of dataset description and dataset interoperability. Other data models may be developed by other WGs and submitted to the DM WG for verification. For instance, the Registry WG is developing a data model for registries.


Interface of the WG with other parts of the IVOA

Data models will be used: - by data providers to describe their data to the VO in a standard way - by service and application developers to read those standard descriptions and work in terms of their concepts - by later versions of the query language to express complex queries in ways that can be reliably matched against data.

This means that data models will be serialized as XML and FITS in VO-published datasets, serialized as XML in the VO query language, and implemented as software classes (bindings) in VO-aware applications. These serializations will probably be specified by the DAL and VOQL working groups in collaboration with the DM WG, but this is still to be worked out.

Clearly the strongest connections of the DM WG are with the DAL (Data Access Layer) WG and the VO Query Language working groups, although all WGs will make use of the concepts worked out in the DM.

Another strong interaction is with the UCD group. UCDs provide simple descriptions of different physical concepts, without the precision and structure of a data model. We anticipate that we will use the UCD semantic vocabulary to tag data model objects with meanings.


Requirements from other WGs

Requirements from other WGs will be provided both informally and in terms of use cases. Currently we have a requirement from the DAL WG to provide a data model that will let data providers describe 1-dimensional spectra in sufficient detail to permit their selection in a query and their interpretation at a simple level after retrieval.

Currently the process to provide requirements is by email to the IVOA DM mailing list.


Data Models from other WGs

We recommend that data models created by other WGs should be submitted first to the DM mailing list in the form of an IVOA Working Draft. The DM WG will comment on the model's completeness and its consistency with other models in the IVOA, but will in general refrain from second-guessing the technical details specific to the other WG's area of specialty. The model will then be circulated as an IVOA Working Draft in the usual way.


Connecting Different Models

In principle one could generate a single massive data model for the whole of the IVOA. However such a model would likely never converge. Part of the WG charter is to ensure that the different data models for different parts of the VO domain are as consistent with one another as possible, and in particular that objects are reused where appropriate. The WG should also ensure that models are as generic as possible - in particular, that the generalization of a given model to all domains of astronomy is discussed. This does not rule out the existence of models which are specialized to particular subdisciplines, but such specialization should be shown to be necessary (due to true uniqueness or to considerations of efficiency or complexity).


Namespaces

At this stage we anticipate that different models will exist for the same part of the problem domain. It has been suggested that models used by particular projects within the VO be prefixed by namespaces, e.g. CfA:Quantity, while models adopted as the preferred description of a concept by the WG be identified by the IVOA namespace, e.g. IVOA:Quantity. This proposal should be discussed further. so want to define interfaces with individual non-VO-specific projects. Specific data models tuned to a particular set of data are developed or in progress by projects such as SDSS, ALMA, Chandra and Planck. Typically these models will hide implicit assumptions specific to their domain, and only make explicit those things which are not constant within their problem. On the other hand, they may elaborate some areas of the problem important to them in much greater detail than the IVOA generic models. We should be able to provide partial mappings between these specific models and our more general approach.


Serializations

The assumption in the IVOA is that the serialization language of choice is XML. We still have to work out the details of the interaction between XML-serialized data models and the VOTABLE XML format.

The FITS format is likely to remain the underlying binary data format for large datasets for the forseeable future. While the FITS format is limited, in particular due to the 8-character keyword limit, it is possible that it will sometimes be convenient to serialize data models fully in FITS, to help analysis systems which do not conveniently interact with XML parsers.


Bindings

The first priority in interoperability is defining the serialization (byte representation as disk file or data stream) of the data model and thus the interpretation of the serialized data. Without this not even a human can interoperabily interpret data from different sources. Also important is the definition of the binding of the data model class methods, i.e. the subroutine interface to the data. These two definitions interact, to the extent that use cases first imply software methods, and thinking about what these methods need in order to work tells us what attributes and metadata must be present in the data files. However, it is not necessary to elaborate all possible operations one might want to do with a data model to have a working implementation. We should therefore err on the side of defining simple interfaces which expose the data concepts as directly as possible while still hiding abstractions which unify different flavors of those concepts.

Proposal by the DM WG chair: The DM WG should define bindings in languages which are ANSI standard and implemented portably on most platforms in common use in the astronomical community. In particular, since user surveys show that Linux is becoming the most common operating system used by astronomers, bindings should be implementable on some flavor of Linux. This in no way prevents IVOA member institutions from implementing non-IVOA-standardized bindings in other non-portable or proprietary languages only available on other operating systems.

Informally, Java seems the most likely language for initial implementations. However, since ANSI C and Fortran remain the most common languages used by developers of astronomical analysis software and by actual astronomers, we may wish to explore bindings in these traditional but still widely used languages. Note that the DM work straddles the interface between the VO web portals and services - firmly in the computer science world where fashionable XML/Java/WSDL technologies are appropriate - and the end-user astronomer analysing data on their desktop, who we must be careful not to alienate by forcing them to use technologies which are both unfamiliar and not as well adapted to their problem.

Topic revision: r3 - 2007-01-19 - BrunoRino
 
This site is powered by the TWiki collaboration platformCopyright © 2008-2019 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback